Skip to content

Chapter 4 · Evaluate AI Systems

In one minute

Chapter 3 gave you the tools; this chapter shows you how to turn them into a reliable, systematic evaluation pipeline for your application. That means choosing the right criteria, deciding how to pick and use public benchmarks/leaderboards (and why you can't fully trust them), and, most importantly, building your own evaluation set and pipeline that reflects what your users actually care about.

Step 1: Define what "good" means for your app

You can't evaluate until you know your evaluation criteria. The book groups them into themes:

  • Domain-specific capability: can it do the actual task (answer support questions, write SQL, etc.)?
  • Generation quality: is the output relevant, coherent, faithful, well-formatted?
  • Instruction-following: does it obey constraints (length, format, role, "only answer from the context")?
  • Factual consistency / groundedness: is it true, and supported by the provided context?
  • Safety: does it avoid harmful, biased, or policy-violating output?
  • Cost & latency: fast and cheap enough to ship.

Turn fuzzy goals into measurable ones

"Be helpful" isn't measurable. "Answers the user's question using only the retrieved documents, in under 150 words, with no fabricated facts" is. Write criteria you could hand to a grader (human or AI).

Step 2: Use public benchmarks & leaderboards, skeptically

Benchmarks (MMLU, etc.) and leaderboards help you shortlist models, but come with traps:

  • Data contamination: benchmark questions may have leaked into training data, inflating scores.
  • Benchmarks saturate: once everyone tops them, they stop distinguishing models.
  • They rarely match your task: a high MMLU score doesn't mean it's good at your support tickets.
  • Aggregate leaderboards hide weaknesses relevant to you.

Public benchmarks answer "is this model generally capable?": not "is it good for my use case?"

Public benchmarks = filter, not verdict

Use them to narrow ~dozens of models down to a few candidates. Then evaluate those candidates on your own data.

Step 3: Select a model (the practical funnel)

A sane model-selection process:

  1. Hard filters: license, privacy/compliance, modality, context length, hosting (API vs. self-host), cost ceiling.
  2. Public reputation: benchmarks/leaderboards to shortlist.
  3. Your own evaluation: run candidates on your task-specific eval set.
  4. Cost/latency/ops: can you actually run it at your scale and budget?

Also a recurring decision: commercial API vs. open-weight self-hosted model: trading convenience and capability against control, privacy, and cost.

Step 4: Build YOUR evaluation pipeline

This is the heart of the chapter, a repeatable system, not a one-off check.

  1. Build an evaluation set. Collect representative, real examples (from logs, users, or curated cases). Include hard cases, edge cases, and known failure modes. Keep it versioned.
  2. Pick methods per criterion. Exact match for structured outputs, AI-judge for open text, functional tests for code, human review for the highest-stakes slices.
  3. Define a scoring rubric and, for AI judges, the exact judge model + prompt.
  4. Automate it. Run the pipeline on every meaningful change (new prompt, new model, new RAG setting) so you get fast, comparable feedback.
  5. Track results over time and watch for regressions.

Keep your eval set private and fresh

If your eval set leaks into prompts or finetuning data, it stops measuring anything. Rotate in new real-world examples regularly so it keeps reflecting actual usage.

Step 5: Evaluation in production

Offline evaluation isn't enough, reality drifts.

  • Log everything (inputs, outputs, context, user signals).
  • Online metrics & A/B tests: measure real user outcomes, not just offline scores.
  • Guardrails & monitoring: catch unsafe or low-quality outputs live.
  • Close the loop: feed production failures back into your eval set (links to Ch 10's feedback loop).

How much evaluation is enough?

Evaluation has a cost too. The book's pragmatic stance: match evaluation rigor to stakes. A throwaway internal tool needs less than a medical or financial product. Start with a lightweight pipeline, then deepen it where the risk (and the cost of being wrong) is highest.

Takeaways

  • First define measurable criteria tied to user value, not vague goals.
  • Public benchmarks shortlist; your own data decides. Beware contamination and saturation.
  • Selecting a model is a funnel: hard filters → reputation → your eval → cost/ops.
  • Build a versioned eval set + automated pipeline and run it on every change.
  • Evaluation continues in production: log, monitor, A/B test, and feed failures back.

My personal learning notes from "AI Engineering" by Chip Huyen (O'Reilly, 2025). Shared for learning purposes, please buy the book.