Skip to content

Chapter 9 · Inference Optimization

In one minute

Chapters 5–8 made the model better; this chapter makes it cheaper and faster to run. Inference is the act of generating outputs from a trained model, and at scale it's where most of the cost and latency live. The chapter optimizes at two levels: the model level (quantization, distillation, etc.) and the inference-service level (batching, caching, parallelism). If you use a model API, the provider handles most of this; if you self-host, it's your job.

First: what are we optimizing?

Three metrics in tension, you usually can't max all three:

  • Latency: how fast the user gets a response.
    • Time to first token (TTFT): how long until output starts (matters for streaming/chat).
    • Time per output token / throughput-per-request: how fast tokens keep coming.
  • Throughput: total tokens/requests served across all users (drives cost-efficiency).
  • Cost: $ per request/token (compute, memory, hardware).

Latency vs. throughput is a trade-off

Techniques like batching boost throughput (cheaper per request) but can increase an individual user's latency. The right balance depends on whether you're optimizing for one user's experience or fleet-wide cost.

Why LLM inference is special

Autoregressive generation has two phases with different bottlenecks:

  • Prefill: process the whole input prompt at once (compute-bound, parallelizable).
  • Decode: generate output one token at a time, each step depending on the last (memory-bandwidth-bound, hard to parallelize).

The decode phase is why generation feels slow and why memory bandwidth, not just raw compute, often limits performance.

Model-level optimization (change the model)

Make the model itself cheaper to run:

  • Quantization: store/compute weights (and sometimes activations) at lower precision (FP16 → INT8 → INT4). Big memory + speed wins for small accuracy cost. The most common, highest-ROI technique.
  • Distillation: train a small "student" model to mimic a large "teacher." You get most of the quality at a fraction of the size/cost (connects to data synthesis in Ch 8).
  • Pruning: remove redundant weights/structure.
  • Architectural tricks: attention optimizations and Mixture-of-Experts (MoE) (only activate part of the model per token) to reduce the compute actually used.

Smaller model, smarter

A distilled + quantized small model finetuned on your task can beat calling a giant general model, cheaper, faster, and private. This "make a specialist" pattern recurs throughout the book.

Service-level optimization (change how you serve)

Make the serving system more efficient without changing the model:

  • Batching: combine multiple requests to use the hardware fully.
    • Static batching (wait, then run) vs. dynamic/continuous batching (add requests as slots free up), continuous batching is key for high-throughput LLM serving.
  • KV cache: store the attention key/value tensors from previous tokens so you don't recompute them every step. Essential for fast decoding (but it eats memory, managing it, e.g. PagedAttention, is a big deal).
  • Prompt / prefix caching: reuse computation for shared prompt prefixes (e.g., a long system prompt used by every request).
  • Parallelism: split big models/workloads across GPUs: tensor, pipeline, and data parallelism.
  • Speculative decoding: a small fast model drafts several tokens; the big model verifies them in one pass, accelerating generation without changing outputs.

API vs. self-hosting

A core decision the chapter surfaces:

Model API (managed)Self-hosting
Optimization workDone for youYour responsibility
Setup speedInstantSlower
Cost at scalePay-per-token (can get pricey)Potentially cheaper if utilized well
Control & privacyLimitedFull
Best forGetting started, variable loadHigh scale, custom models, strict privacy

Takeaways

  • Inference optimization balances latency vs. throughput vs. cost: you trade among them.
  • LLM inference has two phases: prefill (compute-bound) and decode (memory-bound, one token at a time).
  • Model level: quantization (highest ROI), distillation, pruning, MoE.
  • Service level: continuous batching, KV/prefix caching, parallelism, speculative decoding.
  • APIs optimize for you; self-hosting gives control and scale economics but makes optimization your job.

My personal learning notes from "AI Engineering" by Chip Huyen (O'Reilly, 2025). Shared for learning purposes, please buy the book.