Troubleshooting
Debug a sudden cost spike
Drill from the cost dashboard into the run that caused the spike.
Debug a sudden cost spike
Yesterday's spend tripled. Find why and stop the bleeding.
The goal
Identify the agent, user, or version responsible. Take action: roll back, cap, or pause.
Steps
Open the cost dashboard.
Sidebar -> Monitoring -> index page. The chart shows daily spend by agent. The spike's date stands out.
Drill by agent.
Click the spike day. The breakdown shows top agents. The leader is your suspect.
Drill by lane.
Click the suspect agent. Cost by lane shows whether the spike is in
model_inference,embedding,extraction, orjudge. The lane tells you what to suspect:model_inferenceup: a prompt change increased per-turn cost or volume increased.embeddingup: extraction is over-firing or memory imports happened.extractionup: the extractor's policy got chatty.judgeup: an auto-eval started running on every save.
Drill by user.
Same dashboard, switch to "Top users". A single user dominating the agent's spend is likely a bot or a leaking integration.
Drill into a turn.
Pick the most expensive thread of the day. Click into Traces. Find the slowest model.call span; check whether the prompt is unexpectedly large.
Roll back if it was a version regression.
Compare timestamps: if the spike correlates with an agent version save, Rollback to the previous version.
Cap.
Even if you fix the root cause, Set a budget cap so a future regression fails loud, not silent.
Verify
- Post-rollback, daily spend returns to baseline.
- The cap fires alerts when you exceed expected normal-day spend.
Common findings
- Auto-eval on save was enabled and a new version triggered a 50-input judge run.
- A user's session was held open and the chat client was retrying every 2 seconds.
- A new tool returned huge results inflating the next turn's prompt.
- Memory extraction policy moved from
batchedtopost-turn.
Next steps
- Set a budget cap.
- Trace a single turn for forensic detail on a specific spike.
