Debug a sudden cost spike

Yesterday's spend tripled. Find why and stop the bleeding.

The goal

Identify the agent, user, or version responsible. Take action: roll back, cap, or pause.

Open the cost dashboard.

Sidebar -> Monitoring -> index page. The chart shows daily spend by agent. The spike's date stands out.
Drill by agent.

Click the spike day. The breakdown shows top agents. The leader is your suspect.
Drill by lane.

Click the suspect agent. Cost by lane shows whether the spike is in model_inference, embedding, extraction, or judge. The lane tells you what to suspect:
- model_inference up: a prompt change increased per-turn cost or volume increased.
- embedding up: extraction is over-firing or memory imports happened.
- extraction up: the extractor's policy got chatty.
- judge up: an auto-eval started running on every save.
Drill by user.

Same dashboard, switch to "Top users". A single user dominating the agent's spend is likely a bot or a leaking integration.
Drill into a turn.

Pick the most expensive thread of the day. Click into Traces. Find the slowest model.call span; check whether the prompt is unexpectedly large.
Roll back if it was a version regression.

Compare timestamps: if the spike correlates with an agent version save, Rollback to the previous version.
Cap.

Even if you fix the root cause, Set a budget cap so a future regression fails loud, not silent.

Auto-eval on save was enabled and a new version triggered a 50-input judge run.
A user's session was held open and the chat client was retrying every 2 seconds.
A new tool returned huge results inflating the next turn's prompt.
Memory extraction policy moved from batched to post-turn.