Chapter 7 · Finetuning
In one minute
Finetuning changes the model itself by training it further on your data, the most powerful but most expensive quality lever. This chapter covers when to finetune (and when not to), the memory problem that makes finetuning hard at foundation-model scale, and the techniques that make it affordable: parameter-efficient finetuning (PEFT) like LoRA, plus quantized variants and the experimental idea of model merging. It also includes the math for estimating a model's memory footprint.
When to finetune, and when not to
Finetuning is not usually your first move. The book's guidance:
Reach for finetuning when:
- You need a specific behavior, format, tone, or skill that prompting can't reliably produce.
- You want a smaller, cheaper, faster model to match a big one on your narrow task.
- You have a domain/style poorly covered by general models.
- You have (or can create) good training data.
Avoid finetuning when:
- The problem is missing information → use RAG instead (finetuning is bad at injecting fresh facts).
- You haven't exhausted prompting.
- You lack quality data or the ability to evaluate the result.
- The base models are improving so fast your finetune would be obsolete soon.
Finetuning vs. RAG, the cleanest rule of thumb
RAG = give the model knowledge. Finetuning = teach the model a skill/behavior. Need it to know something → RAG. Need it to act/respond a certain way → finetune. They're complementary, not either/or.
What "finetuning" includes
- Supervised finetuning (SFT): train on input→output examples of the behavior you want.
- Preference finetuning: align to human preferences (RLHF, or simpler DPO).
- Continued pre-training: more self-supervised training on domain text to shift the base knowledge/style.
The big obstacle: memory
Foundation models are huge, and full finetuning (updating every parameter) needs memory for:
- the model weights,
- the gradients,
- the optimizer states (e.g., Adam keeps extra values per parameter),
- and the activations.
This often costs several times the memory of just running the model. The chapter shows how to estimate the memory footprint from parameter count, numerical precision (FP32 vs FP16/BF16 vs INT8/INT4), and these overheads, so you can predict whether a finetune fits on your hardware.
Precision = memory
Each parameter takes bytes according to its precision: FP32 = 4 bytes, FP16/BF16 = 2 bytes, INT8 = 1 byte, INT4 = 0.5 byte. Lower precision → less memory (and often faster), at some accuracy risk. This single idea drives most efficiency techniques.
PEFT: finetuning without the full cost
Parameter-efficient finetuning updates only a small fraction of parameters (or adds tiny new ones), slashing memory and storage while keeping most of the quality.
- LoRA (Low-Rank Adaptation): freeze the original weights and train small low-rank adapter matrices added alongside them. You train millions of parameters instead of billions.
- Tiny artifacts: a LoRA adapter is small and easy to store/share. You can keep many adapters for one base model and swap them per task.
- Composable & reversible: the base model stays intact.
- QLoRA: LoRA on top of a quantized (e.g., 4-bit) base model, so you can finetune large models on a single GPU.
Default to PEFT
For most teams, LoRA/QLoRA is the practical way to finetune. Full finetuning is rarely worth its cost unless you have strong reasons and serious hardware.
Model merging (the experimental frontier)
Instead of training one model, combine multiple models/adapters into one:
- Merge finetuned variants (e.g., averaging weights, "task arithmetic") to blend skills without extra training.
- Useful for multi-task models or combining community finetunes.
- Powerful but experimental: results can be unpredictable, so evaluate carefully.
A practical finetuning workflow
- Confirm finetuning is the right lever (not prompting/RAG).
- Choose a base model (size, license, quality, hardware fit).
- Prepare high-quality data (this is the hard part, see Ch 8).
- Pick a method: usually LoRA/QLoRA.
- Set precision & hyperparameters to fit your memory budget.
- Train, then evaluate against your eval set (Ch 4). Compare to the un-finetuned baseline + RAG.
- Iterate on data quality first, it beats hyperparameter tweaking.
Takeaways
- Finetuning is the most powerful, most expensive lever, use it after prompting and RAG, not before.
- RAG for knowledge, finetuning for behavior/skill.
- The core constraint is memory; learn to estimate footprint from params × precision + overhead.
- PEFT (LoRA/QLoRA) makes finetuning affordable by training a tiny fraction of parameters.
- Model merging is a promising but experimental way to combine capabilities.
- Success depends far more on data quality than on clever training tricks.