Staging keys and repeatable evaluation

Mint a persistent dev key, then run reproducible character evals from a clean slate.

The demo sandbox key is great for a first copy-paste, but it is the wrong tool for an eval loop: it is rate-limited at issuance, capped at sixty messages, blocks images + managed delivery, and expires. To evaluate a character repeatedly — across prompt tweaks, model changes, or regression runs — you want a persistent staging key with real limits, plus a pattern that gives every run an identical clean starting state.

This page covers both: minting a durable dev key (instead of fighting the demo rate limit), and the fresh-session pattern that makes repeated evals deterministic.

Staging key vs demo sandbox key

There are two ways to get a key, and they are for different jobs.

POST /v1/demo/sandbox-keyStaging key (POST /v1/appsPOST /v1/api-keys)
Auth to mintnone (public)an existing key for your tenant (your invite key)
Lifetimeshort-lived, auto-expirespersistent
Issuance limitrate-limited (429 when you hit the window)none
Capsthree characters, five sessions, sixty messages, 1,200 input charsyour app’s budget policy
Product surfacesimages / Telegram / webhooks / budget changes disabledenabled
Good forone-off copy-paste, the quickstart, a coding agent’s first callrepeatable evals, CI, staging

The demo key’s issuance rate limit is exactly what blocks an eval loop: re-minting a fresh demo key for each run trips 429. A staging key is minted once and reused, so the rate limit never enters your loop. (To mint a staging key you need a tenant — that comes from an invite key. Email daniel@contxts.io with a one-line note about what you’re building; see Authentication.)

Mint a persistent staging key

Create a dedicated staging app under your tenant, then issue a key for it. The raw key is returned once — store it server-side immediately; it cannot be recovered.

$# 1. Create a staging app (using your invite key). environment: "sandbox" | "production".
$export STAGING_APP_ID="$(curl -sS -X POST https://api.loreos.app/v1/apps \
> -H "Authorization: Bearer $LOREOS_INVITE_KEY" \
> -H "Content-Type: application/json" \
> -d '{ "name": "eval-staging", "environment": "sandbox" }' \
> | jq -r '.data.app_id')"
$
$# 2. Issue an API key for that app. The raw key is shown ONCE.
$export LOREOS_STAGING_KEY="$(curl -sS -X POST https://api.loreos.app/v1/api-keys \
> -H "Authorization: Bearer $LOREOS_INVITE_KEY" \
> -H "Content-Type: application/json" \
> -d "{ \"app_id\": \"$STAGING_APP_ID\" }" \
> | jq -r '.data.api_key')"

POST /v1/apps and POST /v1/api-keys are not available to demo keys (app creation and key issuance are demo-blocked) — that is by design: you mint staging keys from your invite key, not from a demo key. Confirm the key resolves with GET /v1/me, which returns the app, environment, and tenant it maps to.

$curl https://api.loreos.app/v1/me -H "Authorization: Bearer $LOREOS_STAGING_KEY"

Keep LOREOS_STAGING_KEY server-side like any other LoreOS key — it can create state and trigger metered model work.

Raise the limits past the demo caps

A demo key is capped at sixty messages — enough for a quick multi-turn try, but still tight for a real eval suite (and it can’t generate images). A staging key uses your app’s budget policy instead. Set (or raise) the whole-app cap with POST /v1/budget-policiesresource_type: "all" is the app-wide hard cap, enforced in runtime preflight before any provider call:

$curl -X POST https://api.loreos.app/v1/budget-policies \
> -H "Authorization: Bearer $LOREOS_STAGING_KEY" \
> -H "Content-Type: application/json" \
> -d '{ "resource_type": "all", "limit_credits": 50, "enforcement": "hard" }'

Now your eval volume is bounded by your budget, not the demo’s twenty-message cap. Set limit_credits to comfortably cover one full suite run. (Per-end-user caps are set separately and are not needed for eval — see the note on fresh users below.)

When a run does hit the cap, the send returns 402 budget_exceeded before spending on the model, so a runaway eval fails safely and cheaply rather than burning credits.

Deterministic repeated evals: a fresh session per run

The key to reproducible evals is controlling starting state. A LoreOS character accumulates relational and world-model state per end-user as a conversation progresses — so two runs over the same session are not comparable: the second run starts where the first left off.

The clean pattern is the opposite of resetting: for each eval run, create a fresh session with a fresh external_user_ref. A brand-new external_user_ref has no prior relational or world-model state, so every run begins from an identical clean slate with no shared state between runs.

$# A unique external_user_ref per run = a guaranteed clean slate.
$RUN_USER="eval-$(date +%s)-$RANDOM"
$
$curl -X POST https://api.loreos.app/v1/sessions \
> -H "Authorization: Bearer $LOREOS_STAGING_KEY" \
> -H "Content-Type: application/json" \
> -d "{ \"character\": \"luna\", \"external_user_ref\": \"$RUN_USER\" }"

Use a new external_user_ref per run (a timestamp or UUID), not just a new session for the same user. A fresh session already gives the character a clean relational starting point, but world-model and memory state accrue against the end-user — so a brand-new external_user_ref is what guarantees nothing (relational, world-model, or memory) carries over from a previous run.

You do not need a “reset” endpoint

There is no session-reset endpoint, and you don’t need one — fresh sessions are both cleaner and already supported. A reset would have to scrub relational state, world-model state, the event log, and usage in place and prove it left nothing behind; a fresh external_user_ref gets you a guaranteed-clean slate with zero of that risk. If you went looking for a /reset route, this is why you won’t find one. Create a new session, run, then discard it.

A worked eval loop

Put it together: mint the staging key once (above), then per eval — create a fresh session, send your scripted messages, poll for the replies, assert, and discard the session.

1// eval-loop.ts — runs server-side with LOREOS_STAGING_KEY set.
2const BASE = process.env.LOREOS_BASE || "https://api.loreos.app";
3const KEY = process.env.LOREOS_STAGING_KEY!;
4const CHARACTER = process.env.LOREOS_CHARACTER_SLUG || "luna";
5
6async function loreos(path: string, options: RequestInit = {}) {
7 const response = await fetch(`${BASE}${path}`, {
8 ...options,
9 headers: {
10 Authorization: `Bearer ${KEY}`,
11 "Content-Type": "application/json",
12 ...(options.headers || {}),
13 },
14 });
15 const body = await response.json().catch(() => ({}));
16 if (!response.ok) {
17 const detail = (body as any).detail || {};
18 throw new Error(`${response.status} ${detail.code || "error"}: ${detail.fix || detail.message}`);
19 }
20 return (body as any).data;
21}
22
23// Send one scripted message and wait (sync-feeling) for the character's reply bubbles.
24async function sendAndAwaitReply(sessionId: string, text: string): Promise<string[]> {
25 const accepted = await loreos(`/v1/sessions/${sessionId}/messages`, {
26 method: "POST",
27 body: JSON.stringify({ text, reply_mode: "fast" }),
28 });
29 let cursor: number = accepted.cursor;
30 const deadline = Date.now() + 45_000;
31 while (Date.now() < deadline) {
32 const page = await loreos(`/v1/sessions/${sessionId}/events?since=${cursor}`);
33 for (const ev of page.events || []) {
34 cursor = ev.cursor;
35 const type = ev.event_type || ev.type;
36 if (type === "message.created" && ev.role === "character") {
37 return ev.payload?.bubbles?.length ? ev.payload.bubbles : [ev.payload?.text ?? ""];
38 }
39 if (type === "message.failed") throw new Error(`reply failed: ${ev.payload?.reason}`);
40 }
41 await new Promise((r) => setTimeout(r, 1_000));
42 }
43 throw new Error("timed out waiting for reply");
44}
45
46// One eval run over a clean slate.
47async function runEval(script: { say: string; expect: (bubbles: string[]) => boolean }[]) {
48 // FRESH user ref -> no prior relational/world-model state -> identical clean start.
49 const externalUserRef = `eval-${Date.now()}-${Math.random().toString(36).slice(2)}`;
50 const session = await loreos("/v1/sessions", {
51 method: "POST",
52 body: JSON.stringify({ character: CHARACTER, external_user_ref: externalUserRef }),
53 });
54
55 const results: { say: string; reply: string[]; pass: boolean }[] = [];
56 for (const step of script) {
57 const reply = await sendAndAwaitReply(session.session_id, step.say);
58 results.push({ say: step.say, reply, pass: step.expect(reply) });
59 }
60 // Discard: just stop using the session. The next run gets its own fresh user ref.
61 return { externalUserRef, results, passed: results.every((r) => r.pass) };
62}
63
64// Example: run the same script repeatedly; each run starts from the same clean slate.
65const script = [
66 { say: "hi! what's your name?", expect: (b: string[]) => b.join(" ").toLowerCase().includes("luna") },
67 { say: "what are you up to today?", expect: (b: string[]) => b.length > 0 },
68];
69
70for (let i = 0; i < 5; i++) {
71 const run = await runEval(script);
72 console.log(`run ${i}: ${run.passed ? "PASS" : "FAIL"}`, run.results.map((r) => r.pass));
73}

What makes this reproducible:

  • The key is minted once, so you never hit the demo issuance rate limit mid-suite.
  • Each run gets a fresh external_user_ref, so there is no shared relational or world-model state — every run starts identically.
  • Limits come from your app’s budget policy, so volume isn’t capped at the demo’s twenty messages; raise limit_credits to cover a full run.
  • Replies are asynchronous, so each assertion waits on the event log for that turn’s message.created — assert on payload.bubbles, not on the POST /messages response.

Notes for stable evals

  • Author the character to status="ready" first. A minimal smoke-test character (display name + one bio line) produces flat, less differentiated replies, which makes assertions noisy. Use POST /v1/characters/validate (dry-run: same validation + readiness scoring as create, without persisting) to iterate to authoring_readiness.status = "ready" before you evaluate. See the Character authoring guide and the Characters API reference for the full authoring contract.
  • Reply variance is expected. LoreOS replies are model-generated and intentionally lifelike, so two runs of the same script won’t be byte-identical even from an identical clean slate. Assert on properties (a fact recalled, a tone, the absence of forbidden content, relationship pacing) rather than exact strings.
  • fast mode keeps accuracy. Evaluating with reply_mode: "fast" (the default) only skips the advisory voice-quality critic; grounding, safety, and accuracy are unchanged, so it is a faithful target for accuracy/behavior assertions. Use deep only if your eval is specifically grading voice polish.
  • Per-end-user caps aren’t needed for eval. Because each run uses a throwaway external_user_ref, the app-level budget policy is sufficient. (If you do want to cap a single end-user in production, that is POST /v1/external-users/{ref}/budget-policy, not the app-level route above.)