Skip to content

Glossary

Quick, plain-English definitions of the key terms used throughout AI Engineering.

Core concepts

Foundation model: A large model trained on massive data that can be adapted to many different tasks. The "foundation" you build applications on top of.

Large language model (LLM): A foundation model specialized in text.

Large multimodal model (LMM): A foundation model that handles multiple data types (text, images, audio, etc.).

AI engineering: Building applications on top of readily available foundation models (vs. training models from scratch).

Token: The basic unit a model reads/writes: a character, word, or word-piece. ~100 tokens ≈ 75 English words.

Tokenization: Splitting text into tokens.

Vocabulary: The full set of tokens a model can use.

Context window: The maximum number of tokens a model can consider in one prompt.

How models work

Self-supervision: Learning from unlabeled data by predicting parts of it (e.g., the next token). Removed the labeling bottleneck and enabled foundation models.

Autoregressive model: Predicts the next token from previous tokens; powers generative AI.

Masked language model: Fills in blanks using context from both sides (e.g., BERT); good for understanding/classification.

Transformer: The dominant model architecture, built on the attention mechanism.

Attention: A mechanism that lets the model weigh which earlier tokens matter most when producing the next one.

Parameters / weights: The learned numbers that define a model; "size" is often measured in parameter count.

Scaling laws: Empirical rules describing how performance improves with more parameters, data, and compute.

Pre-training: The initial, large-scale self-supervised training that creates a base model.

Post-training: Steps after pre-training that make a model helpful and aligned (SFT + preference tuning).

SFT (Supervised Finetuning): Training on curated input→output examples to teach instruction-following and style.

RLHF (Reinforcement Learning from Human Feedback): Aligning a model to human preferences via a reward model + RL.

DPO (Direct Preference Optimization): A simpler alternative to RLHF for preference tuning.

Generation

Inference: Using a trained model to generate outputs.

Sampling: Picking the next token from the model's probability distribution.

Temperature: Controls randomness: higher = more creative, lower = more focused/deterministic.

Top-k / Top-p (nucleus): Restrict sampling to the most likely tokens (by count, or by cumulative probability).

Hallucination: A confident but false or unsupported output.

Evaluation

Cross-entropy: How "surprised" a model is by the true next token (lower = better predictions).

Perplexity: Exponential of cross-entropy; roughly, how many options the model is choosing between.

BLEU / ROUGE: Lexical (word-overlap) similarity metrics.

Semantic similarity: Comparing meaning via embeddings.

Embedding: A vector (list of numbers) representing the meaning of text/data.

AI as a judge (LLM-as-a-judge): Using a strong model to grade outputs; scalable but biased and inconsistent.

Functional correctness: Checking output by running it (e.g., does the generated code pass tests).

Benchmark: A standardized test set (e.g., MMLU) for comparing models.

Data contamination: When test data leaks into training data, inflating benchmark scores.

Improving quality

Prompt engineering: Improving output by changing the instructions/context, not the model.

System / user prompt: The role-setting instructions vs. the actual request.

In-context learning: Adapting behavior from instructions/examples in the prompt, without retraining.

Zero-shot / few-shot: Prompting with no examples / with a few examples.

Chain-of-thought (CoT): Asking the model to reason step by step to improve accuracy.

Prompt injection: Malicious instructions hidden in user input or external data.

Jailbreaking: Tricking a model into bypassing its safety rules.

RAG (Retrieval-Augmented Generation): Fetching relevant information and adding it to the prompt to ground answers.

Retriever: The component that finds relevant documents for RAG.

Vector database: A store for embeddings that supports similarity search.

Reranking: Re-scoring retrieved candidates to surface the best ones.

Hybrid search: Combining keyword (sparse) and embedding (dense) retrieval.

Agent: A model that takes actions via tools, often over multiple steps, to achieve a goal.

Tool: A capability an agent can call (search, code execution, APIs, etc.).

Memory (short/long-term): Storing conversation/session state, or persisted knowledge retrieved when relevant.

Finetuning & data

Finetuning: Further training a model on your data to change its behavior/skills.

PEFT (Parameter-Efficient Finetuning): Finetuning only a small fraction of parameters to save memory/cost.

LoRA (Low-Rank Adaptation): Popular PEFT method: train small adapter matrices while freezing base weights.

QLoRA: LoRA applied on top of a quantized base model.

Model merging: Combining multiple models/adapters into one (experimental).

Distillation: Training a small "student" model to imitate a large "teacher."

Data synthesis: Using AI to generate training data.

Model collapse: Quality degradation from repeatedly training on AI-generated data.

Deduplication / decontamination: Removing duplicate data / removing eval-set overlap from training data.

Production

Quantization: Storing/computing in lower numerical precision (FP16/INT8/INT4) to save memory and speed up inference.

Pruning: Removing redundant weights/structure from a model.

Mixture-of-Experts (MoE): Activating only part of a model per token to reduce compute.

Latency / throughput: Speed for one request vs. total volume served.

TTFT (Time To First Token): How long until output begins.

Prefill / decode: The input-processing phase vs. the token-by-token generation phase.

Batching: Grouping requests to use hardware efficiently (continuous batching for LLMs).

KV cache: Storing previous tokens' attention keys/values to avoid recomputation.

Speculative decoding: A small model drafts tokens the big model verifies, speeding generation.

Guardrails: Input/output filters that block unsafe or invalid content.

Model router / gateway: Routes queries to the right model / a unified, managed entry point to models.

Observability: Logging and monitoring to understand and debug a live system.

My personal learning notes from "AI Engineering" by Chip Huyen (O'Reilly, 2025). Shared for learning purposes, please buy the book.