Skip to content

Cheat Sheet

A one-page refresher of the whole book. Use it to jog your memory or decide your next move.

The mental model

Improve a model's answer using four levers, cheapest first:

  1. Sampling settings (temperature, top-p), free (Ch 2)
  2. Prompt: instructions & examples (Ch 5)
  3. Context: RAG & agents (Ch 6)
  4. The model: finetuning & data (Ch 7–8)

Golden rule: start simple, escalate only when the simpler lever runs out. Prompt → RAG → Finetune.

Decision guide

If you need to…UseChapter
Change tone/format/behavior quicklyPrompting5
Give the model facts/knowledge it lacksRAG6
Have the model do things (multi-step, tools)Agents6
Teach a durable skill or shrink to a specialistFinetuning (LoRA/QLoRA)7
Fix "answers are wrong/made up"Grounding (RAG) + lower temp + eval2,3,6
Make it cheaper/fasterQuantization, distillation, batching, caching9
Know if any change actually helpedEvaluation pipeline3,4

RAG vs. Finetuning (the classic question)

  • RAG = give the model knowledge → use when info is missing, changing, or private.
  • Finetuning = teach the model a skill/behavior → use for format, tone, style, or specialization.
  • They're complementary. RAG is usually cheaper and easier, try it first.

Evaluation checklist

  • [ ] Defined measurable criteria tied to user value
  • [ ] Built a versioned eval set from real examples (incl. edge cases)
  • [ ] Chose methods per criterion (exact / lexical / semantic / AI judge / functional)
  • [ ] Controlled AI-judge bias (pairwise, rubrics, pinned model)
  • [ ] Automated the pipeline; run on every change
  • [ ] Decontaminated eval data from training/prompt data
  • [ ] Monitoring + A/B tests in production

Prompt engineering checklist

  • [ ] Clear, specific task + constraints + audience
  • [ ] Role/persona set in system prompt
  • [ ] Few-shot examples for tricky formats
  • [ ] Ask for structured output (with schema)
  • [ ] Decompose complex tasks; chain prompts
  • [ ] Allow "I don't know" to cut hallucinations
  • [ ] Important info at start/end (avoid lost-in-the-middle)
  • [ ] Version prompts; test against eval set

Security must-dos

  • Separate trusted instructions from untrusted input/external data
  • Filter inputs (injection/PII) and outputs (leaks/unsafe)
  • Least privilege for agents/tools; sandbox code; approve risky actions
  • Never put real secrets in the prompt; assume context can leak
  • Red-team and monitor continuously

Inference optimization toolbox

  • Model level: quantization (best ROI), distillation, pruning, MoE
  • Service level: continuous batching, KV/prefix cache, parallelism, speculative decoding
  • API vs. self-host: API optimizes for you; self-hosting = control + scale economics, but your responsibility

Production architecture build order

Simple call → + Context (RAG/agents) → + Guardrails → + Router/Gateway → + Cache → + Orchestration → + Monitoring

…plus a user feedback loop (explicit + implicit signals) that feeds evaluation and finetuning data.

The 10 chapters in one line each

  1. Building AI Apps: scale created foundation models & a new discipline; ask "should I build this?"
  2. Foundation Models: data + architecture/scale + post-training shape behavior; sampling explains quirks.
  3. Evaluation Methodology: perplexity, similarity, and AI-as-a-judge; evaluation is the hardest part.
  4. Evaluating AI Systems: define criteria, distrust benchmarks, build your own eval pipeline.
  5. Prompt Engineering: cheapest lever; best practices + prompt security.
  6. RAG and Agents: give the model context (retrieval) and actions (tools).
  7. Finetuning: change the model itself; LoRA/QLoRA make it affordable.
  8. Dataset Engineering: data is the bottleneck; acquire, synthesize, clean, and measure quality.
  9. Inference Optimization: make it faster/cheaper at model and service levels.
  10. Architecture & Feedback: assemble the system; close the feedback loop to keep improving.

My personal learning notes from "AI Engineering" by Chip Huyen (O'Reilly, 2025). Shared for learning purposes, please buy the book.