Chapter 9 · Inference Optimization
In one minute
Chapters 5–8 made the model better; this chapter makes it cheaper and faster to run. Inference is the act of generating outputs from a trained model, and at scale it's where most of the cost and latency live. The chapter optimizes at two levels: the model level (quantization, distillation, etc.) and the inference-service level (batching, caching, parallelism). If you use a model API, the provider handles most of this; if you self-host, it's your job.
First: what are we optimizing?
Three metrics in tension, you usually can't max all three:
- Latency: how fast the user gets a response.
- Time to first token (TTFT): how long until output starts (matters for streaming/chat).
- Time per output token / throughput-per-request: how fast tokens keep coming.
- Throughput: total tokens/requests served across all users (drives cost-efficiency).
- Cost: $ per request/token (compute, memory, hardware).
Latency vs. throughput is a trade-off
Techniques like batching boost throughput (cheaper per request) but can increase an individual user's latency. The right balance depends on whether you're optimizing for one user's experience or fleet-wide cost.
Why LLM inference is special
Autoregressive generation has two phases with different bottlenecks:
- Prefill: process the whole input prompt at once (compute-bound, parallelizable).
- Decode: generate output one token at a time, each step depending on the last (memory-bandwidth-bound, hard to parallelize).
The decode phase is why generation feels slow and why memory bandwidth, not just raw compute, often limits performance.
Model-level optimization (change the model)
Make the model itself cheaper to run:
- Quantization: store/compute weights (and sometimes activations) at lower precision (FP16 → INT8 → INT4). Big memory + speed wins for small accuracy cost. The most common, highest-ROI technique.
- Distillation: train a small "student" model to mimic a large "teacher." You get most of the quality at a fraction of the size/cost (connects to data synthesis in Ch 8).
- Pruning: remove redundant weights/structure.
- Architectural tricks: attention optimizations and Mixture-of-Experts (MoE) (only activate part of the model per token) to reduce the compute actually used.
Smaller model, smarter
A distilled + quantized small model finetuned on your task can beat calling a giant general model, cheaper, faster, and private. This "make a specialist" pattern recurs throughout the book.
Service-level optimization (change how you serve)
Make the serving system more efficient without changing the model:
- Batching: combine multiple requests to use the hardware fully.
- Static batching (wait, then run) vs. dynamic/continuous batching (add requests as slots free up), continuous batching is key for high-throughput LLM serving.
- KV cache: store the attention key/value tensors from previous tokens so you don't recompute them every step. Essential for fast decoding (but it eats memory, managing it, e.g. PagedAttention, is a big deal).
- Prompt / prefix caching: reuse computation for shared prompt prefixes (e.g., a long system prompt used by every request).
- Parallelism: split big models/workloads across GPUs: tensor, pipeline, and data parallelism.
- Speculative decoding: a small fast model drafts several tokens; the big model verifies them in one pass, accelerating generation without changing outputs.
API vs. self-hosting
A core decision the chapter surfaces:
| Model API (managed) | Self-hosting | |
|---|---|---|
| Optimization work | Done for you | Your responsibility |
| Setup speed | Instant | Slower |
| Cost at scale | Pay-per-token (can get pricey) | Potentially cheaper if utilized well |
| Control & privacy | Limited | Full |
| Best for | Getting started, variable load | High scale, custom models, strict privacy |
Takeaways
- Inference optimization balances latency vs. throughput vs. cost: you trade among them.
- LLM inference has two phases: prefill (compute-bound) and decode (memory-bound, one token at a time).
- Model level: quantization (highest ROI), distillation, pruning, MoE.
- Service level: continuous batching, KV/prefix caching, parallelism, speculative decoding.
- APIs optimize for you; self-hosting gives control and scale economics but makes optimization your job.