Skip to content

Chapter 8 · Dataset Engineering

In one minute

Finetuning frameworks are easy; getting good data is the hard part. This chapter is all about data: how to figure out what data you need, how to acquire it, how to synthesize it (using AI to generate training data), and how to process it (clean, dedupe, format). It also tackles the central question, what does "data quality" even mean, and how do you measure it? These ideas apply well beyond finetuning.

Data is the real bottleneck

Across the book, the recurring lesson: model quality is downstream of data quality. You can swap models and tune hyperparameters all day, but if your data is wrong, noisy, or unrepresentative, results won't improve. So treat the dataset as a first-class engineered artifact.

What data do you actually need?

Before collecting anything, define:

  • The behavior/skill you're teaching (decides what examples look like).
  • Data format: instruction→response pairs, preference pairs (chosen vs. rejected), raw domain text for continued pre-training, etc.
  • Coverage: the range of inputs, edge cases, and difficulty your app will face.
  • How much: which depends on the task and method (PEFT often needs far less than full finetuning).

Quality and coverage beat raw volume

A few thousand clean, diverse, correct examples usually beat a giant pile of noisy ones, especially with PEFT/LoRA. Start small, evaluate, and grow the dataset where the model is weak.

The three dimensions of a dataset

The book frames dataset design around three properties to balance:

  1. Quality: correct, relevant, well-formatted, low-noise examples.
  2. Coverage / diversity: spans the real distribution of inputs and edge cases.
  3. Quantity: enough examples for the method and task.

Data acquisition

Where data comes from, roughly in order of preference when available:

  • Your own application/production data: the most relevant (logs, user interactions, with consent/compliance).
  • Human annotation: accurate but slow and expensive; needs clear guidelines and quality control.
  • Public/open datasets: fast to start, but check license, quality, and relevance.
  • Augmentation: transform existing data to create more variety.

Always mind privacy, consent, licensing, and compliance: data provenance matters legally and ethically.

Data synthesis (AI making data for AI)

A major, modern topic: use models to generate training data.

Why it's attractive

  • Scale & speed: generate large datasets cheaply.
  • Cover rare cases: manufacture hard or underrepresented examples on demand.
  • Privacy: synthetic data can avoid exposing real user data.
  • Distillation: use a strong "teacher" model to generate data that trains a smaller "student" model (a key way small models get good, see Ch 9).

The dangers

  • Quality & correctness: generated data can be subtly wrong; it needs verification.
  • Bias amplification: the generator's flaws get baked into the student.
  • Model collapse: training repeatedly on AI-generated data can degrade quality over generations.
  • Legal/ToS issues: using one provider's model to train a competitor may violate terms.

Verify synthetic data

Synthetic data is powerful but not free quality. Filter and verify it (rules, AI judges, functional checks, human spot-checks) before training on it.

Data processing

Turning raw data into a clean training set:

  • Clean: remove corrupt, irrelevant, or low-quality samples.
  • Deduplicate: duplicates waste compute, skew the model, and cause leakage between train/eval. (Critical.)
  • Filter: by quality scores, language, length, safety.
  • Decontaminate: remove any overlap with your evaluation set so scores stay honest.
  • Format: convert to the exact template the training method expects (chat format, special tokens, etc.).

How to evaluate data quality

You can (and should) measure data, not just trust it:

  • Manual inspection: actually read samples (underrated, high-value).
  • Heuristics & statistics: length distributions, duplication rate, label balance, perplexity outliers.
  • AI judges / classifiers: score examples for quality or relevance at scale.
  • Train-and-check: the ultimate test: does training on it actually improve your eval metrics?

Takeaways

  • Data is the bottleneck. Engineer your dataset as deliberately as your code.
  • Balance the three dimensions (quality, coverage, quantity), and favor quality/coverage over sheer size.
  • Acquire from your own data first; supplement with annotation, public data, and augmentation, minding licensing & privacy.
  • Synthetic data + distillation is powerful for scale and small-model training, but must be verified (watch for bias and model collapse).
  • Deduplicate and decontaminate religiously, and measure data quality instead of assuming it.

My personal learning notes from "AI Engineering" by Chip Huyen (O'Reilly, 2025). Shared for learning purposes, please buy the book.