Replacing your existing chat/completions backend
Replacing your existing chat/completions backend
Move an app from a synchronous /chat/completions call to LoreOS’s async event model — without a UX downgrade.
If your app already calls a POST /chat/completions-style endpoint and renders the
returned text, this page is the migration. The shape changes in exactly one place: the
reply is no longer in the HTTP response. Everything else — your UI, your end-user auth,
your message history — stays yours.
This guide gives you a drop-in adapter that hides the change behind a sync-feeling function, plus the production patterns (retry safety, typing UX, polling-vs-SSE, and a graceful fallback to your old provider during cutover).
The one mental shift
A classic completions call is synchronous: you send messages, the HTTP response is the reply.
LoreOS is asynchronous: POST /v1/sessions/{id}/messages returns immediately with a
cursor and a run_ref — an acknowledgement, not the reply. The character’s reply
arrives a few seconds later as a message.created event on the session’s event log, which
you read by polling, SSE, or webhook.
Why it works this way. A LoreOS reply is not a single model call. The character runs a multi-step engine per turn — retrieval over its evolving world model, relational and emotional state, grounding and safety/accuracy gates, voice shaping, then multi-bubble emission and delivery. That work takes seconds and can be recovered if your process dies mid-turn (a durability ledger re-runs it). Returning a cursor immediately lets you ack the send instantly, show a typing indicator, and read a richer multi-bubble reply when it lands — instead of holding an HTTP request open through the whole engine.
Two consequences to internalize before you write code:
- The reply is multi-bubble.
payload.bubblesis an ordered list (a messenger-style burst);payload.textis the joined convenience form. Renderbubbleswhen present. - Don’t read top-level response fields. Every response is wrapped as
{ schema_version, data, next_actions }. Readresponse.data.cursor, notresponse.cursor. Errors are{ detail: { code, message, fix } }.
A reference adapter (sync façade over the async events)
The cleanest migration keeps your call sites looking synchronous. Write one server-side
function — getReply(sessionId, text) — that sends the message, waits for the
message.created event, and returns the bubbles. Your existing code calls it the same way
it called chat/completions; the async model is hidden inside.
This matches the examples/loreos-node-chat style: a small loreos() helper that unwraps
data and throws on the { detail } error envelope, plus a cursor-poll loop.
Your call site barely changes:
The session is created once per end-user (POST /v1/sessions with your
external_user_ref) and reused for the conversation — it is the persistent thread, not a
per-message construct. See Quickstart
for character + session creation.
reply_mode: keep messenger-grade latency
POST /messages accepts an optional reply_mode:
fast(default) — use this for migrations. It is the same light path the managed Telegram channels run, tuned for messenger-grade latency. It skips only the advisory voice-quality critic (a final stylistic polish pass). Grounding, safety, knowledge, and the emission gates all stay on, so factual accuracy and safety are unchanged — you are not trading correctness for speed.deep— opts into the full critic stack, which adds the quality critic and its one-shot voice rewrite for maximum voice polish. It is roughly 17s slower per turn.
If you omit reply_mode, you get fast. The deep world-model and relational update runs
asynchronously after the reply either way — fast does not skip the character
learning from the turn, it only skips the synchronous voice-polish pass. The response
echoes the reply_mode it used, so you can confirm it.
Retry safety with Idempotency-Key
Serverless platforms (Vercel, Lambda) and network proxies retry requests. Without
protection, a retried send creates a duplicate turn and a duplicate reply. Send an
Idempotency-Key header with a value that is stable per logical message (your own message
id is ideal):
A same-key re-call returns the original cursor and sent_turn_index plus
idempotent_replay: true, instead of creating a second turn. Your adapter can treat that
identically — it polls from the same cursor and gets the same reply.
Dedupe on the read side too: every event carries a monotonic cursor. Track the highest
cursor you have processed and ignore anything at or below it, so a re-poll or an
at-least-once webhook never double-renders a bubble.
Typing UX off the run.status event
The instant LoreOS accepts your send, it emits a run.status event so your UI can show a
typing indicator while the engine works:
Render “typing…” when you see run.status, and clear it on the next
message.created (role character). This is an additive event type — if you only handle
message.created, you simply won’t show typing; nothing breaks. The run_ref in the
payload matches the run_ref returned by your send, so you can scope the indicator to the
exact turn.
Polling vs SSE vs webhook: which transport
All three transports project the same cursored event log, so you can mix them and resume across them with one cursor. Pick by surface:
Decision guide:
- Migrating a request/response chat app? Start with polling inside the adapter. It is the least infrastructure and matches the synchronous call site you are replacing.
- Building a live, always-open chat surface? Use SSE for lower latency, and reconnect-with-cursor on the 5-minute cap.
- No user-facing client at all (a bot, a pipeline, a backend bridge)? Use a webhook so LoreOS pushes to your server; you don’t hold connections open.
When in doubt, polling is never wrong — the other two are optimizations over the same log.
Graceful fallback during cutover
While you migrate, keep your old LLM provider wired as a fallback so a slow or unavailable LoreOS turn degrades instead of failing the user. Wrap the adapter with a timeout and fall back to your existing completions call:
Notes on doing this safely:
- Always pass the
Idempotency-Keyso a fallback that races a slow-but-successful LoreOS turn does not create a duplicate when you retry later. - A
402 budget_exceededis a budget signal, not an outage — it is returned before any model spend. Treat it distinctly (raise the cap or back off), not as a reason to fall back to your old provider. - Fallback replies don’t carry LoreOS memory or world-model continuity. Keep the fallback
window short and prefer raising your timeout or using
fastmode over leaning on it.
Vercel / Next.js: keep the key server-side
LOREOS_KEY can create state and trigger metered model work, so it must never reach the
browser. On Next.js, put the adapter behind a Route Handler (app/api/.../route.ts)
or a Server Action, set LOREOS_KEY as a server-only environment variable, and have
the browser call your own endpoint — which proxies to LoreOS.
The browser does fetch("/api/chat", …); your route handler holds the key and talks to
LoreOS. The same rule applies to the SSE and webhook transports — proxy them through your
backend, or terminate them server-side.
Migration checklist
- Stop reading the reply from the
POST /messagesresponse; readdata.cursor+data.run_ref, then read the reply from the event log. - Render
payload.bubbles(multi-bubble), falling back topayload.text. - Create a session per end-user with your
external_user_refand reuse it. - Default to
reply_mode: "fast"; reach fordeeponly where voice polish beats latency. - Send an
Idempotency-Key(your message id) on every send. - Show typing off
run.status; clear onmessage.created. - Dedupe reads/webhooks by the monotonic
cursor. - Keep
LOREOS_KEYserver-side; proxy the browser through your backend. - (During cutover) wrap the adapter in a timeout + fallback to your old provider.
Once you are migrated, see Staging keys and repeatable evaluation to set up a persistent dev key and reproducible eval runs before you cut production over.