Chapter 6 · RAG and Agents
In one minute
A model can only answer well if it has the right context. This chapter covers the two big patterns for giving models the information and capabilities they lack:
- RAG (Retrieval-Augmented Generation): fetch relevant information and put it in the prompt. Mature, reliable, widely used in production.
- Agents: let the model take actions using tools (search, code, APIs) in a loop. Far more powerful, but more complex and still maturing.
Both exist because of one hard limit: a model's knowledge is frozen at training time and its context window is finite.
Why context matters
- Models don't know your private data, recent events, or your user's specifics.
- The context window (how many tokens fit in one prompt) is limited, so you can't just paste everything.
- Solution: put the right information in front of the model at the right time. That's what RAG and agents do.
Part 1: RAG
The core idea
Instead of relying on what the model memorized, retrieve relevant documents from an external knowledge source and add them to the prompt. The model then answers grounded in that context.
User question
│
▼
[Retriever] ──fetches──> top-k relevant chunks from your knowledge base
│
▼
Prompt = question + retrieved chunks
│
▼
[Model] ──> grounded answer (ideally citing the chunks)Why RAG works
- Reduces hallucinations by grounding answers in real, supplied text.
- Adds fresh & private knowledge without retraining.
- Cheaper than finetuning for "the model needs to know X."
- Updatable: change the knowledge base, not the model.
- Traceable: you can show which sources were used.
The retriever is the heart of RAG
Quality depends almost entirely on retrieval. Two main approaches:
- Term-based (sparse / keyword): e.g., BM25. Fast, great for exact terms, no training needed. Misses synonyms/meaning.
- Embedding-based (dense / semantic): embed text into vectors; retrieve by vector similarity in a vector database. Captures meaning, but needs a good embedding model and more infrastructure.
- Hybrid search: combine both, often with reranking, for the best results.
Better retrieval > bigger model
Most "RAG isn't working" problems are retrieval problems. If the right chunk never reaches the prompt, no model can answer well. Invest there first.
Practical RAG knobs
- Chunking: how you split documents (size, overlap, by structure) strongly affects quality.
- Top-k: how many chunks to retrieve (more context vs. noise & cost).
- Reranking: re-score retrieved candidates to push the best ones to the top.
- Query rewriting: reshape the user's question for better retrieval.
- Metadata filtering: restrict by source, date, permissions.
- Beyond text: RAG can retrieve from tabular data / SQL, APIs, and multimodal sources, not just documents.
Part 2: Agents
What is an agent?
An agent is a model that can perceive its environment and act on it using tools, often over multiple steps, to accomplish a goal. Where RAG adds information, agents take actions.
Goal → [Model plans] → [calls a Tool] → [observes result] → [plans again] → ... → DoneTools: the agent's hands
Tools extend what the model can do:
- Knowledge tools: search, retrieval, database queries (RAG can be a tool).
- Action tools: send email, call an API, run code, edit a file.
- Capability tools: calculators, code interpreters (offload what models are bad at, like exact math).
The agent's tool inventory defines its power and its risk.
Planning and reasoning
The hard part of an agent is planning: breaking a goal into steps, choosing tools, and adjusting based on results. Patterns include reasoning-then-acting loops (e.g., ReAct-style), reflection/self-correction, and plan-then-execute. More autonomy = more capability but less predictability.
Why agents are hard
- Errors compound. A small mistake early can derail a long chain of steps.
- Cost & latency balloon with many model calls and tool round-trips.
- Security risk is higher: an agent that can act can do real damage, and is a juicy target for prompt injection (Ch 5).
- Evaluation is harder: you must judge the whole trajectory, not just a final string (did it pick the right tools? take safe actions? finish efficiently?).
Give agents the least power they need
The more an agent can do, the more it can break. Restrict tools, sandbox code execution, require approval for risky actions, and assume external content may try to hijack it.
Memory
Agents (and chat apps) need memory beyond the context window:
- Short-term: the current conversation/session.
- Long-term: persisted facts/history, often stored externally and retrieved (RAG-like) when relevant.
RAG vs. Agents, how to choose
| RAG | Agents | |
|---|---|---|
| Adds… | Information | Actions + information |
| Maturity | Production-proven | Emerging |
| Complexity | Lower | Higher |
| Cost/latency | Moderate | Higher (many calls) |
| Best for | "Answer using my knowledge" | "Accomplish a multi-step task" |
Start with RAG. Reach for agents when the task genuinely needs multi-step action and tool use.
Takeaways
- The real lever here is context: get the right info/capabilities to the model.
- RAG grounds answers in retrieved data, reliable, cheap, updatable. Retrieval quality is everything; use hybrid search + reranking.
- Agents let models act via tools in a loop, powerful but complex, costly, and riskier.
- Agents demand trajectory-level evaluation and strict security/least privilege.
- Default to RAG; escalate to agents only when the task truly requires action and planning.