Run an A/B eval suite

Build a fixed input set, score it under both versions, read the delta.

The goal

A reproducible numeric comparison between two agent versions on a golden set, broken down by criterion.

Build a golden set.

/agent-evals -> "New golden set". Add inputs (paste, CSV import, or "include rated messages" to seed from real thumbs-ups). Aim for 30-100 inputs to start.
Define criteria.

/eval-criteria -> "New criterion". For each:
- LLM-judge: "On 1-5, how concise is the reply?" Pick the judge model.
- Keyword match: "must contain the user's name".
- Cost / latency thresholds.
Save. Each criterion is reusable across runs.
Run A/B.

On the agent, A/B Evals tab -> "New run". Pick versions (current + canary), the golden set, the criteria. Click Run.

The runtime fans the inputs across both versions in parallel. Progress streams live.
Read the result.
- Summary: overall winner per criterion, total delta.
- Per-input: rows sorted by absolute delta. Click to see both responses side-by-side.
- Cost: per-version spend on the run.
A win on the criteria you care about, no regression on the others, no large cost delta -> promote.

The run produces deterministic-ish results when re-run (small judge variance).
Per-input rows match the agent's behaviour when chatted manually.
The eval cost matches the per-call expectation: inputs * (versions * (model_calls + judge_calls)).