Cheat Sheet

A one-page refresher of the whole book. Use it to jog your memory or decide your next move.

The mental model

Improve a model's answer using four levers, cheapest first:

Golden rule: start simple, escalate only when the simpler lever runs out. Prompt → RAG → Finetune.

If you need to…	Use	Chapter
Change tone/format/behavior quickly	Prompting	5
Give the model facts/knowledge it lacks	RAG	6
Have the model do things (multi-step, tools)	Agents	6
Teach a durable skill or shrink to a specialist	Finetuning (LoRA/QLoRA)	7
Fix "answers are wrong/made up"	Grounding (RAG) + lower temp + eval	2,3,6
Make it cheaper/faster	Quantization, distillation, batching, caching	9
Know if any change actually helped	Evaluation pipeline	3,4

RAG = give the model knowledge → use when info is missing, changing, or private.
Finetuning = teach the model a skill/behavior → use for format, tone, style, or specialization.
They're complementary. RAG is usually cheaper and easier, try it first.

[ ] Defined measurable criteria tied to user value
[ ] Built a versioned eval set from real examples (incl. edge cases)
[ ] Chose methods per criterion (exact / lexical / semantic / AI judge / functional)
[ ] Controlled AI-judge bias (pairwise, rubrics, pinned model)
[ ] Automated the pipeline; run on every change
[ ] Decontaminated eval data from training/prompt data
[ ] Monitoring + A/B tests in production

Model level: quantization (best ROI), distillation, pruning, MoE
Service level: continuous batching, KV/prefix cache, parallelism, speculative decoding
API vs. self-host: API optimizes for you; self-hosting = control + scale economics, but your responsibility

Simple call → + Context (RAG/agents) → + Guardrails → + Router/Gateway → + Cache → + Orchestration → + Monitoring

…plus a user feedback loop (explicit + implicit signals) that feeds evaluation and finetuning data.

Building AI Apps: scale created foundation models & a new discipline; ask "should I build this?"
Foundation Models: data + architecture/scale + post-training shape behavior; sampling explains quirks.
Evaluation Methodology: perplexity, similarity, and AI-as-a-judge; evaluation is the hardest part.
Evaluating AI Systems: define criteria, distrust benchmarks, build your own eval pipeline.
Prompt Engineering: cheapest lever; best practices + prompt security.
RAG and Agents: give the model context (retrieval) and actions (tools).
Finetuning: change the model itself; LoRA/QLoRA make it affordable.
Dataset Engineering: data is the bottleneck; acquire, synthesize, clean, and measure quality.
Inference Optimization: make it faster/cheaper at model and service levels.
Architecture & Feedback: assemble the system; close the feedback loop to keep improving.