Chapter 3 · Evaluation Methodology
In one minute
Evaluation is called the hardest challenge in AI engineering, and it gets two chapters. The problem: foundation models produce open-ended, free-form output where there's often no single right answer, and the same input can vary. This chapter builds your evaluation toolbox: how language models are scored under the hood (perplexity, cross-entropy), the families of evaluation methods (exact, lexical, semantic), and the increasingly popular "AI as a judge."
Why evaluation is so hard
- No ground truth. "Summarize this article" has many good answers, not one.
- Open-ended text can't be checked with a simple
==. - Non-determinism. Sampling means outputs vary run to run.
- Capabilities are broad. A general model can be tested on endless tasks, so picking what to measure is itself hard.
- Failure is subtle. A fluent, confident answer can still be wrong (hallucination).
If you remember one thing
Most failed AI projects don't fail because the model is bad, they fail because the team couldn't reliably tell whether changes made things better or worse. Build evaluation first.
Language modeling metrics (the "under the hood" scores)
These measure how well a model predicts text. They're useful for training/monitoring, less so for end-user quality.
- Cross-entropy: how surprised the model is by the actual next token. Lower = better predictions.
- Perplexity: the exponential of cross-entropy; intuitively, how many options the model is effectively choosing between at each step. Lower perplexity = more confident/structured text.
- Bits-per-character / bits-per-byte: tokenizer-independent variants.
Use perplexity wisely
Low perplexity means the text is predictable to the model, not necessarily correct or useful. It's a great signal for data quality and detecting unusual inputs, not a substitute for task evaluation.
How to compare two outputs: the method families
When you need to score a generated answer, you pick from increasingly flexible methods:
- Exact match: does the output equal the reference exactly? Great for math answers, classification labels, structured outputs; useless for open text.
- Lexical similarity: overlap of words/phrases (e.g., BLEU, ROUGE). Cheap, but rewards surface word-matching and misses meaning.
- Semantic similarity: compare embeddings (vector representations) so "fast car" ≈ "speedy automobile." Captures meaning, but depends on the embedding model's quality.
Each is a trade-off between cost and how well it captures real quality.
AI as a judge (LLM-as-a-judge)
Since human evaluation is slow and expensive, the field increasingly uses a strong model to grade outputs. This is one of the most practical ideas in the book.
Why it's powerful
- Fast, cheap, and scalable compared to humans.
- Flexible, you can ask it to grade anything with a rubric (helpfulness, correctness, tone, safety).
- Can give explanations, not just scores.
How it's used
- Score a single output against a rubric (e.g., 1–5 for relevance).
- Compare two outputs (pairwise) and pick the winner.
- Reference-based or reference-free grading.
The catches (be careful!)
- It's inconsistent: the judge is itself a probabilistic model.
- Biases: prefers longer answers, prefers its own style/outputs (self-bias), sensitive to option order (position bias).
- Not a fixed standard: if the judge model changes, your scores shift, breaking comparisons over time.
- Circularity: using a model to judge a model can hide shared blind spots.
Make AI judges reliable
Use clear rubrics, few-shot examples, pairwise comparisons (more robust than absolute scores), control for position/length bias, and spot-check against humans. Pin the judge model + prompt as part of your eval so results stay comparable.
Other evaluation approaches
- Specialized scorers / reward models trained to predict quality or human preference.
- Functional correctness: for code, run it and test it; for tool use, check the action succeeded. The gold standard when you can get it.
- Comparative evaluation: rank models against each other (the basis for leaderboards / Elo-style ratings).
Takeaways
- Evaluation is hard because outputs are open-ended and non-deterministic, and it's where most projects quietly fail.
- Perplexity/cross-entropy measure prediction quality, not usefulness.
- Scoring methods trade cost for fidelity: exact → lexical → semantic.
- AI-as-a-judge is the scalable workhorse, but it's biased and inconsistent, so engineer it carefully.
- Prefer functional correctness and pairwise comparison whenever you can.