Staging keys and repeatable evaluation

The demo sandbox key is great for a first copy-paste, but it is the wrong tool for an eval loop: it is rate-limited at issuance, capped at 150 messages, blocks images + managed delivery, and expires. To evaluate a character repeatedly — across prompt tweaks, model changes, or regression runs — you want a persistent staging key with real limits, plus a pattern that gives every run an identical clean starting state.

This page covers both: getting a durable dev key (instead of fighting the demo rate limit), and the fresh-session pattern that makes repeated evals deterministic.

Staging key vs demo sandbox key

There are two ways to get a key, and they are for different jobs.

	`POST /v1/demo/sandbox-key`	Staging key
Auth to get	none (public)	self-serve account access
Lifetime	short-lived, auto-expires	persistent
Issuance limit	rate-limited (`429` when you hit the window)	none
Caps	five characters, fifteen sessions, 150 messages, 2,000 input chars	your provisioned app cap
Product surfaces	images / Telegram / webhooks / budget changes disabled	enabled
Good for	one-off copy-paste, the quickstart, a coding agent’s first call	repeatable evals, CI, staging

The demo key’s issuance rate limit is exactly what blocks an eval loop: re-minting a fresh demo key for each run trips 429. A staging key is issued once and reused, so the rate limit never enters your loop. Use Self-serve access to create or select a workspace, approve your coding agent, create a staging app, and issue a runtime key for that app.

Get a persistent staging key

For a dedicated staging key, use the self-serve account flow: a human approves the coding agent, the agent creates a staging app through /v1/account/*, and the agent issues a runtime API key for that app. Account-plane routes are guide-documented, not part of the generated runtime API Reference/SDK yet; this keeps normal runtime integrations focused on app keys and /v1/* Character OS calls. Once you have the runtime key, confirm it resolves with GET /v1/me, which returns the app, environment, and tenant it maps to.

$ curl https://api.loreos.app/v1/me -H "Authorization: Bearer $LOREOS_STAGING_KEY"

Keep LOREOS_STAGING_KEY server-side like any other LoreOS key — it can create state and trigger metered model work.

Raise the limits past the demo caps

A demo key is capped at 150 messages — enough for several real multi-turn conversations, but still tight for a real eval suite (and it can’t generate images). A staging key uses your app cap instead. Set the cap high enough for one full suite run in the console/account plane when enabled, or through the preview provisioning channel. Then monitor settled usage with GET /v1/usage and rate/cap behavior with GET /v1/rates.

Per-end-user caps are public and can be set with POST /v1/external-users/{external_user_ref}/budget-policy, but most eval loops use a fresh end-user per run and rely on the app cap.

When a run does hit the cap, the send returns 402 budget_exceeded before spending on the model, so a runaway eval fails safely and cheaply rather than burning credits.

Deterministic repeated evals: a fresh session per run

The key to reproducible evals is controlling starting state. A LoreOS character accumulates relational and world-model state per end-user as a conversation progresses — so two runs over the same session are not comparable: the second run starts where the first left off.

The clean pattern is the opposite of resetting: for each eval run, create a fresh session with a fresh external_user_ref. A brand-new external_user_ref has no prior relational or world-model state, so every run begins from an identical clean slate with no shared state between runs.

$ # A unique external_user_ref per run = a guaranteed clean slate.
$ RUN_USER="eval-$(date +%s)-$RANDOM"
$ 
$ curl -X POST https://api.loreos.app/v1/sessions \
>   -H "Authorization: Bearer $LOREOS_STAGING_KEY" \
>   -H "Content-Type: application/json" \
>   -d "{ \"character\": \"luna\", \"external_user_ref\": \"$RUN_USER\" }"

Use a new external_user_ref per run (a timestamp or UUID), not just a new session for the same user. A fresh session already gives the character a clean relational starting point, but world-model and memory state accrue against the end-user — so a brand-new external_user_ref is what guarantees nothing (relational, world-model, or memory) carries over from a previous run.

You do not need a “reset” endpoint

There is no session-reset endpoint, and you don’t need one — fresh sessions are both cleaner and already supported. A reset would have to scrub relational state, world-model state, the event log, and usage in place and prove it left nothing behind; a fresh external_user_ref gets you a guaranteed-clean slate with zero of that risk. If you went looking for a /reset route, this is why you won’t find one. Create a new session, run, then discard it.

A worked eval loop

Put it together: get and reuse the staging key once (above), then per eval — create a fresh session, send your scripted messages, poll for the replies, assert, and discard the session.

1 // eval-loop.ts — runs server-side with LOREOS_STAGING_KEY set.
2 const BASE = process.env.LOREOS_BASE || "https://api.loreos.app";
3 const KEY = process.env.LOREOS_STAGING_KEY!;
4 const CHARACTER = process.env.LOREOS_CHARACTER_SLUG || "luna";
5 
6 async function loreos(path: string, options: RequestInit = {}) {
7   const response = await fetch(`${BASE}${path}`, {
8     ...options,
9     headers: {
10       Authorization: `Bearer ${KEY}`,
11       "Content-Type": "application/json",
12       ...(options.headers || {}),
13     },
14   });
15   const body = await response.json().catch(() => ({}));
16   if (!response.ok) {
17     const detail = (body as any).detail || {};
18     throw new Error(`${response.status} ${detail.code || "error"}: ${detail.fix || detail.message}`);
19   }
20   return (body as any).data;
21 }
22 
23 // Send one scripted message and wait (sync-feeling) for the character's reply bubbles.
24 async function sendAndAwaitReply(sessionId: string, text: string): Promise<string[]> {
25   const accepted = await loreos(`/v1/sessions/${sessionId}/messages`, {
26     method: "POST",
27     body: JSON.stringify({ text, reply_mode: "fast" }),
28   });
29   let cursor: number = accepted.cursor;
30   const deadline = Date.now() + 45_000;
31   while (Date.now() < deadline) {
32     const page = await loreos(`/v1/sessions/${sessionId}/events?since=${cursor}`);
33     for (const ev of page.events || []) {
34       cursor = ev.cursor;
35       const type = ev.type || ev.event_type;
36       if (type === "message.created" && ev.role === "character") {
37         return ev.payload?.bubbles?.length ? ev.payload.bubbles : [ev.payload?.text ?? ""];
38       }
39       if (type === "message.failed") throw new Error(`reply failed: ${ev.payload?.reason}`);
40     }
41     await new Promise((r) => setTimeout(r, 1_000));
42   }
43   throw new Error("timed out waiting for reply");
44 }
45 
46 // One eval run over a clean slate.
47 async function runEval(script: { say: string; expect: (bubbles: string[]) => boolean }[]) {
48   // FRESH user ref -> no prior relational/world-model state -> identical clean start.
49   const externalUserRef = `eval-${Date.now()}-${Math.random().toString(36).slice(2)}`;
50   const session = await loreos("/v1/sessions", {
51     method: "POST",
52     body: JSON.stringify({ character: CHARACTER, external_user_ref: externalUserRef }),
53   });
54 
55   const results: { say: string; reply: string[]; pass: boolean }[] = [];
56   for (const step of script) {
57     const reply = await sendAndAwaitReply(session.session_id, step.say);
58     results.push({ say: step.say, reply, pass: step.expect(reply) });
59   }
60   // Discard: just stop using the session. The next run gets its own fresh user ref.
61   return { externalUserRef, results, passed: results.every((r) => r.pass) };
62 }
63 
64 // Example: run the same script repeatedly; each run starts from the same clean slate.
65 const script = [
66   { say: "hi! what's your name?", expect: (b: string[]) => b.join(" ").toLowerCase().includes("luna") },
67   { say: "what are you up to today?", expect: (b: string[]) => b.length > 0 },
68 ];
69 
70 for (let i = 0; i < 5; i++) {
71   const run = await runEval(script);
72   console.log(`run ${i}: ${run.passed ? "PASS" : "FAIL"}`, run.results.map((r) => r.pass));
73 }

What makes this reproducible:

The key is minted once, so you never hit the demo issuance rate limit mid-suite.
Each run gets a fresh external_user_ref, so there is no shared relational or world-model state — every run starts identically.
Limits come from your app’s budget policy, so volume isn’t capped at the demo’s twenty messages; raise limit_credits to cover a full run.
Replies are asynchronous, so each assertion waits on the event log for that turn’s message.created — assert on payload.bubbles, not on the POST /messages response.

Notes for stable evals

Author the character to status="ready" first. A minimal smoke-test character (display name + one bio line) produces flat, less differentiated replies, which makes assertions noisy. Use POST /v1/characters/validate (dry-run: same validation + readiness scoring as create, without persisting) to iterate to authoring_readiness.status = "ready" before you evaluate. See the Character authoring guide and the Characters API reference for the full authoring contract.
Reply variance is expected. LoreOS replies are model-generated and intentionally lifelike, so two runs of the same script won’t be byte-identical even from an identical clean slate. Assert on properties (a fact recalled, a tone, the absence of forbidden content, relationship pacing) rather than exact strings.
fast mode keeps accuracy. Evaluating with reply_mode: "fast" (the default) only skips the advisory voice-quality critic; grounding, safety, and accuracy are unchanged, so it is a faithful target for accuracy/behavior assertions. Use deep only if your eval is specifically grading voice polish.
Per-end-user caps aren’t needed for eval. Because each run uses a throwaway external_user_ref, the app-level budget policy is sufficient. (If you do want to cap a single end-user in production, that is POST /v1/external-users/{ref}/budget-policy, not the app-level route above.)