Chapter 2 · Understanding Foundation Models
In one minute
You don't need to train foundation models, but understanding how they're made helps you pick the right one and get more out of it. This chapter covers the three big design decisions that shape a model's behavior (training data, architecture & scale, and post-training (alignment)), and then explains how a model generates text. That last part demystifies weird behaviors like inconsistency and hallucination, and reveals sampling settings as a cheap, powerful lever.
Design decision 1: Training data
A model can only be as good as the data it learned from. Key ideas:
- Garbage in, garbage out. The mix, quality, and breadth of training data determine what the model knows and how it's biased.
- Common Crawl and the web dominate general models, which means data quality is uneven and skewed toward English and toward whatever is abundant online.
- Domain and language gaps are real: if your use case is low-resource (a niche domain or a less-common language), a general model may underperform, a reason you might seek a specialized model or finetune later.
Practical takeaway
Before adopting a model, ask what was it trained on? A model strong on English web text may be weak on legal documents, your native language, or recent events.
Design decision 2: Architecture and scale
- The transformer is the dominant architecture. Its attention mechanism lets the model weigh which earlier tokens matter when producing the next one, that's what made long-range understanding work at scale. (Alternatives exist, but transformers remain the default.)
- Scale is described with three numbers:
- Number of parameters: the model's "size"/capacity.
- Training data size: how many tokens it saw.
- Compute (FLOPs): how much calculation went into training.
- Scaling laws tell us how to spend a compute budget: bigger models and more data, balanced. (The Chinchilla insight: many early models were too big for how little data they were trained on.)
- The bottleneck is shifting from compute to data: we risk running out of high-quality public text.
Parameters ≠ everything
A bigger model isn't automatically better for you. A smaller, well-trained, or finetuned model can beat a giant one on a specific task, while being far cheaper and faster.
Design decision 3: Post-training (making it useful and aligned)
A freshly pre-trained model is good at predicting text, but not necessarily at being helpful or safe. Post-training fixes that, usually in two steps:
- Supervised finetuning (SFT): train on high-quality example conversations/instructions so the model learns to follow instructions and respond in the desired style.
- Preference finetuning (RLHF and friends): humans rank outputs; the model is tuned to prefer responses people like. Techniques include RLHF (reward model + reinforcement learning) and simpler alternatives like DPO.
This is why two models of similar size can feel very different: their post-training (and the values baked in) differ.
How a model generates a response
This is the part that explains the model's "personality" and its failures.
- The model outputs a probability distribution over the next token, then a sampling step picks one. Repeat token-by-token.
- Sampling settings you can tune:
- Temperature: higher = more random/creative, lower = more focused/deterministic.
- Top-k: only consider the k most likely tokens.
- Top-p (nucleus): consider the smallest set of tokens whose probability adds up to p.
- Stop sequences, max tokens, logprobs, etc.
The cheapest performance boost
Changing sampling settings costs nothing and can dramatically change output quality. Need consistent, factual answers? Lower the temperature. Need brainstorming? Raise it.
Why models are inconsistent
Because generation involves random sampling, the same prompt can give different answers. Two related problems:
- Same input, different outputs: fixable by lowering temperature / fixing a seed where supported.
- Slightly different input, very different outputs: models can be sensitive to small prompt changes.
Why models hallucinate
A hallucination is a confident but false statement. The chapter frames two main explanations:
- The model can't tell what it knows from what it's plausibly guessing: it's optimized to produce likely-sounding text, not true text.
- Training can inadvertently teach the model to make things up (e.g., when finetuning data contains answers the base model couldn't have known, it learns that "producing an answer" is rewarded even without grounding).
Mitigations preview the rest of the book: grounding with context (RAG), better prompting, verification/evaluation, and lower-temperature decoding for factual tasks.
Takeaways
- A model is the product of three choices: data, architecture/scale, and post-training. Each leaves fingerprints on its behavior.
- Transformers + scaling laws explain why models got so capable; data is becoming the limiting resource.
- Post-training (SFT + preference tuning) is what turns a text predictor into a helpful assistant.
- Generation is probabilistic sampling: that's the root of both creativity and inconsistency/hallucination.
- Sampling settings are a free, powerful lever. Tune them before reaching for anything fancier.