Chapter 9 · Inference Optimization

In one minute

Chapters 5–8 made the model better; this chapter makes it cheaper and faster to run. Inference is the act of generating outputs from a trained model, and at scale it's where most of the cost and latency live. The chapter optimizes at two levels: the model level (quantization, distillation, etc.) and the inference-service level (batching, caching, parallelism). If you use a model API, the provider handles most of this; if you self-host, it's your job.

First: what are we optimizing?

Three metrics in tension, you usually can't max all three:

Latency: how fast the user gets a response.
- Time to first token (TTFT): how long until output starts (matters for streaming/chat).
- Time per output token / throughput-per-request: how fast tokens keep coming.
Throughput: total tokens/requests served across all users (drives cost-efficiency).
Cost: $ per request/token (compute, memory, hardware).

Latency vs. throughput is a trade-off

Techniques like batching boost throughput (cheaper per request) but can increase an individual user's latency. The right balance depends on whether you're optimizing for one user's experience or fleet-wide cost.

Why LLM inference is special

Autoregressive generation has two phases with different bottlenecks:

Prefill: process the whole input prompt at once (compute-bound, parallelizable).
Decode: generate output one token at a time, each step depending on the last (memory-bandwidth-bound, hard to parallelize).

The decode phase is why generation feels slow and why memory bandwidth, not just raw compute, often limits performance.

Model-level optimization (change the model)

Make the model itself cheaper to run:

Quantization: store/compute weights (and sometimes activations) at lower precision (FP16 → INT8 → INT4). Big memory + speed wins for small accuracy cost. The most common, highest-ROI technique.
Distillation: train a small "student" model to mimic a large "teacher." You get most of the quality at a fraction of the size/cost (connects to data synthesis in Ch 8).
Pruning: remove redundant weights/structure.
Architectural tricks: attention optimizations and Mixture-of-Experts (MoE) (only activate part of the model per token) to reduce the compute actually used.

Smaller model, smarter

A distilled + quantized small model finetuned on your task can beat calling a giant general model, cheaper, faster, and private. This "make a specialist" pattern recurs throughout the book.

Service-level optimization (change how you serve)

Make the serving system more efficient without changing the model:

Batching: combine multiple requests to use the hardware fully.
- Static batching (wait, then run) vs. dynamic/continuous batching (add requests as slots free up), continuous batching is key for high-throughput LLM serving.
KV cache: store the attention key/value tensors from previous tokens so you don't recompute them every step. Essential for fast decoding (but it eats memory, managing it, e.g. PagedAttention, is a big deal).
Prompt / prefix caching: reuse computation for shared prompt prefixes (e.g., a long system prompt used by every request).
Parallelism: split big models/workloads across GPUs: tensor, pipeline, and data parallelism.
Speculative decoding: a small fast model drafts several tokens; the big model verifies them in one pass, accelerating generation without changing outputs.

API vs. self-hosting

A core decision the chapter surfaces:

	Model API (managed)	Self-hosting
Optimization work	Done for you	Your responsibility
Setup speed	Instant	Slower
Cost at scale	Pay-per-token (can get pricey)	Potentially cheaper if utilized well
Control & privacy	Limited	Full
Best for	Getting started, variable load	High scale, custom models, strict privacy

Takeaways

Inference optimization balances latency vs. throughput vs. cost: you trade among them.
LLM inference has two phases: prefill (compute-bound) and decode (memory-bound, one token at a time).
Model level: quantization (highest ROI), distillation, pruning, MoE.
Service level: continuous batching, KV/prefix caching, parallelism, speculative decoding.
APIs optimize for you; self-hosting gives control and scale economics but makes optimization your job.

Chapter 9 · Inference Optimization ​

In one minute ​

First: what are we optimizing? ​

Why LLM inference is special ​

Model-level optimization (change the model) ​

Service-level optimization (change how you serve) ​

API vs. self-hosting ​

Takeaways ​