Glossary
Quick, plain-English definitions of the key terms used throughout AI Engineering.
Core concepts
Foundation model: A large model trained on massive data that can be adapted to many different tasks. The "foundation" you build applications on top of.
Large language model (LLM): A foundation model specialized in text.
Large multimodal model (LMM): A foundation model that handles multiple data types (text, images, audio, etc.).
AI engineering: Building applications on top of readily available foundation models (vs. training models from scratch).
Token: The basic unit a model reads/writes: a character, word, or word-piece. ~100 tokens ≈ 75 English words.
Tokenization: Splitting text into tokens.
Vocabulary: The full set of tokens a model can use.
Context window: The maximum number of tokens a model can consider in one prompt.
How models work
Self-supervision: Learning from unlabeled data by predicting parts of it (e.g., the next token). Removed the labeling bottleneck and enabled foundation models.
Autoregressive model: Predicts the next token from previous tokens; powers generative AI.
Masked language model: Fills in blanks using context from both sides (e.g., BERT); good for understanding/classification.
Transformer: The dominant model architecture, built on the attention mechanism.
Attention: A mechanism that lets the model weigh which earlier tokens matter most when producing the next one.
Parameters / weights: The learned numbers that define a model; "size" is often measured in parameter count.
Scaling laws: Empirical rules describing how performance improves with more parameters, data, and compute.
Pre-training: The initial, large-scale self-supervised training that creates a base model.
Post-training: Steps after pre-training that make a model helpful and aligned (SFT + preference tuning).
SFT (Supervised Finetuning): Training on curated input→output examples to teach instruction-following and style.
RLHF (Reinforcement Learning from Human Feedback): Aligning a model to human preferences via a reward model + RL.
DPO (Direct Preference Optimization): A simpler alternative to RLHF for preference tuning.
Generation
Inference: Using a trained model to generate outputs.
Sampling: Picking the next token from the model's probability distribution.
Temperature: Controls randomness: higher = more creative, lower = more focused/deterministic.
Top-k / Top-p (nucleus): Restrict sampling to the most likely tokens (by count, or by cumulative probability).
Hallucination: A confident but false or unsupported output.
Evaluation
Cross-entropy: How "surprised" a model is by the true next token (lower = better predictions).
Perplexity: Exponential of cross-entropy; roughly, how many options the model is choosing between.
BLEU / ROUGE: Lexical (word-overlap) similarity metrics.
Semantic similarity: Comparing meaning via embeddings.
Embedding: A vector (list of numbers) representing the meaning of text/data.
AI as a judge (LLM-as-a-judge): Using a strong model to grade outputs; scalable but biased and inconsistent.
Functional correctness: Checking output by running it (e.g., does the generated code pass tests).
Benchmark: A standardized test set (e.g., MMLU) for comparing models.
Data contamination: When test data leaks into training data, inflating benchmark scores.
Improving quality
Prompt engineering: Improving output by changing the instructions/context, not the model.
System / user prompt: The role-setting instructions vs. the actual request.
In-context learning: Adapting behavior from instructions/examples in the prompt, without retraining.
Zero-shot / few-shot: Prompting with no examples / with a few examples.
Chain-of-thought (CoT): Asking the model to reason step by step to improve accuracy.
Prompt injection: Malicious instructions hidden in user input or external data.
Jailbreaking: Tricking a model into bypassing its safety rules.
RAG (Retrieval-Augmented Generation): Fetching relevant information and adding it to the prompt to ground answers.
Retriever: The component that finds relevant documents for RAG.
Vector database: A store for embeddings that supports similarity search.
Reranking: Re-scoring retrieved candidates to surface the best ones.
Hybrid search: Combining keyword (sparse) and embedding (dense) retrieval.
Agent: A model that takes actions via tools, often over multiple steps, to achieve a goal.
Tool: A capability an agent can call (search, code execution, APIs, etc.).
Memory (short/long-term): Storing conversation/session state, or persisted knowledge retrieved when relevant.
Finetuning & data
Finetuning: Further training a model on your data to change its behavior/skills.
PEFT (Parameter-Efficient Finetuning): Finetuning only a small fraction of parameters to save memory/cost.
LoRA (Low-Rank Adaptation): Popular PEFT method: train small adapter matrices while freezing base weights.
QLoRA: LoRA applied on top of a quantized base model.
Model merging: Combining multiple models/adapters into one (experimental).
Distillation: Training a small "student" model to imitate a large "teacher."
Data synthesis: Using AI to generate training data.
Model collapse: Quality degradation from repeatedly training on AI-generated data.
Deduplication / decontamination: Removing duplicate data / removing eval-set overlap from training data.
Production
Quantization: Storing/computing in lower numerical precision (FP16/INT8/INT4) to save memory and speed up inference.
Pruning: Removing redundant weights/structure from a model.
Mixture-of-Experts (MoE): Activating only part of a model per token to reduce compute.
Latency / throughput: Speed for one request vs. total volume served.
TTFT (Time To First Token): How long until output begins.
Prefill / decode: The input-processing phase vs. the token-by-token generation phase.
Batching: Grouping requests to use hardware efficiently (continuous batching for LLMs).
KV cache: Storing previous tokens' attention keys/values to avoid recomputation.
Speculative decoding: A small model drafts tokens the big model verifies, speeding generation.
Guardrails: Input/output filters that block unsafe or invalid content.
Model router / gateway: Routes queries to the right model / a unified, managed entry point to models.
Observability: Logging and monitoring to understand and debug a live system.