Staging keys and repeatable evaluation
The demo sandbox key is great for a first copy-paste, but it is the wrong tool for an eval loop: it is rate-limited at issuance, capped at sixty messages, blocks images + managed delivery, and expires. To evaluate a character repeatedly — across prompt tweaks, model changes, or regression runs — you want a persistent staging key with real limits, plus a pattern that gives every run an identical clean starting state.
This page covers both: minting a durable dev key (instead of fighting the demo rate limit), and the fresh-session pattern that makes repeated evals deterministic.
Staging key vs demo sandbox key
There are two ways to get a key, and they are for different jobs.
The demo key’s issuance rate limit is exactly what blocks an eval loop: re-minting a fresh
demo key for each run trips 429. A staging key is minted once and reused, so the rate
limit never enters your loop. (To mint a staging key you need a tenant — that comes from an
invite key. Email daniel@contxts.io with a one-line note
about what you’re building; see
Authentication.)
Mint a persistent staging key
Create a dedicated staging app under your tenant, then issue a key for it. The raw key is returned once — store it server-side immediately; it cannot be recovered.
POST /v1/apps and POST /v1/api-keys are not available to demo keys (app creation and
key issuance are demo-blocked) — that is by design: you mint staging keys from your invite
key, not from a demo key. Confirm the key resolves with GET /v1/me, which returns the app,
environment, and tenant it maps to.
Keep LOREOS_STAGING_KEY server-side like any other LoreOS key — it can create state and
trigger metered model work.
Raise the limits past the demo caps
A demo key is capped at sixty messages — enough for a quick multi-turn try, but still tight
for a real eval suite (and it can’t generate images). A staging key uses your app’s budget
policy instead. Set (or raise) the whole-app cap with
POST /v1/budget-policies — resource_type: "all" is the app-wide hard cap, enforced in
runtime preflight before any provider call:
Now your eval volume is bounded by your budget, not the demo’s twenty-message cap. Set
limit_credits to comfortably cover one full suite run. (Per-end-user caps are set
separately and are not needed for eval — see the note on fresh users below.)
When a run does hit the cap, the send returns 402 budget_exceeded before spending on
the model, so a runaway eval fails safely and cheaply rather than burning credits.
Deterministic repeated evals: a fresh session per run
The key to reproducible evals is controlling starting state. A LoreOS character accumulates relational and world-model state per end-user as a conversation progresses — so two runs over the same session are not comparable: the second run starts where the first left off.
The clean pattern is the opposite of resetting: for each eval run, create a fresh
session with a fresh external_user_ref. A brand-new external_user_ref has no prior
relational or world-model state, so every run begins from an identical clean slate with
no shared state between runs.
Use a new external_user_ref per run (a timestamp or UUID), not just a new session for the
same user. A fresh session already gives the character a clean relational starting point,
but world-model and memory state accrue against the end-user — so a brand-new
external_user_ref is what guarantees nothing (relational, world-model, or memory)
carries over from a previous run.
You do not need a “reset” endpoint
There is no session-reset endpoint, and you don’t need one — fresh sessions are both
cleaner and already supported. A reset would have to scrub relational state, world-model
state, the event log, and usage in place and prove it left nothing behind; a fresh
external_user_ref gets you a guaranteed-clean slate with zero of that risk. If you went
looking for a /reset route, this is why you won’t find one. Create a new session, run,
then discard it.
A worked eval loop
Put it together: mint the staging key once (above), then per eval — create a fresh session, send your scripted messages, poll for the replies, assert, and discard the session.
What makes this reproducible:
- The key is minted once, so you never hit the demo issuance rate limit mid-suite.
- Each run gets a fresh
external_user_ref, so there is no shared relational or world-model state — every run starts identically. - Limits come from your app’s budget policy, so volume isn’t capped at the demo’s twenty
messages; raise
limit_creditsto cover a full run. - Replies are asynchronous, so each assertion waits on the event log for that turn’s
message.created— assert onpayload.bubbles, not on thePOST /messagesresponse.
Notes for stable evals
- Author the character to
status="ready"first. A minimal smoke-test character (display name + one bio line) produces flat, less differentiated replies, which makes assertions noisy. UsePOST /v1/characters/validate(dry-run: same validation + readiness scoring as create, without persisting) to iterate toauthoring_readiness.status = "ready"before you evaluate. See the Character authoring guide and the Characters API reference for the full authoring contract. - Reply variance is expected. LoreOS replies are model-generated and intentionally lifelike, so two runs of the same script won’t be byte-identical even from an identical clean slate. Assert on properties (a fact recalled, a tone, the absence of forbidden content, relationship pacing) rather than exact strings.
fastmode keeps accuracy. Evaluating withreply_mode: "fast"(the default) only skips the advisory voice-quality critic; grounding, safety, and accuracy are unchanged, so it is a faithful target for accuracy/behavior assertions. Usedeeponly if your eval is specifically grading voice polish.- Per-end-user caps aren’t needed for eval. Because each run uses a throwaway
external_user_ref, the app-level budget policy is sufficient. (If you do want to cap a single end-user in production, that isPOST /v1/external-users/{ref}/budget-policy, not the app-level route above.)