All guides

Recipes

Run an A/B eval suite

Build a golden set, define criteria, and compare two agent versions head to head.

Run an A/B eval suite

Build a fixed input set, score it under both versions, read the delta.

The goal

A reproducible numeric comparison between two agent versions on a golden set, broken down by criterion.

Steps

  1. Build a golden set.

    /agent-evals -> "New golden set". Add inputs (paste, CSV import, or "include rated messages" to seed from real thumbs-ups). Aim for 30-100 inputs to start.

  2. Define criteria.

    /eval-criteria -> "New criterion". For each:

    • LLM-judge: "On 1-5, how concise is the reply?" Pick the judge model.
    • Keyword match: "must contain the user's name".
    • Cost / latency thresholds.

    Save. Each criterion is reusable across runs.

  3. Run A/B.

    On the agent, A/B Evals tab -> "New run". Pick versions (current + canary), the golden set, the criteria. Click Run.

    The runtime fans the inputs across both versions in parallel. Progress streams live.

  4. Read the result.

    • Summary: overall winner per criterion, total delta.
    • Per-input: rows sorted by absolute delta. Click to see both responses side-by-side.
    • Cost: per-version spend on the run.

    A win on the criteria you care about, no regression on the others, no large cost delta -> promote.

Verify

  • The run produces deterministic-ish results when re-run (small judge variance).
  • Per-input rows match the agent's behaviour when chatted manually.
  • The eval cost matches the per-call expectation: inputs * (versions * (model_calls + judge_calls)).

Tips

  • Lower agent + judge temperature for tighter comparisons.
  • For pure prompt comparison, clone the agent twice with retrieval and tools off.

Next steps

Talk to Platos

Powered by the Platos runtime

Powered by Platos →