Chapter 6 · RAG and Agents

In one minute

A model can only answer well if it has the right context. This chapter covers the two big patterns for giving models the information and capabilities they lack:

RAG (Retrieval-Augmented Generation): fetch relevant information and put it in the prompt. Mature, reliable, widely used in production.
Agents: let the model take actions using tools (search, code, APIs) in a loop. Far more powerful, but more complex and still maturing.

Both exist because of one hard limit: a model's knowledge is frozen at training time and its context window is finite.

Why context matters

Models don't know your private data, recent events, or your user's specifics.
The context window (how many tokens fit in one prompt) is limited, so you can't just paste everything.
Solution: put the right information in front of the model at the right time. That's what RAG and agents do.

Part 1: RAG

The core idea

Instead of relying on what the model memorized, retrieve relevant documents from an external knowledge source and add them to the prompt. The model then answers grounded in that context.

text

User question
     │
     ▼
[Retriever] ──fetches──> top-k relevant chunks from your knowledge base
     │
     ▼
Prompt = question + retrieved chunks
     │
     ▼
[Model] ──> grounded answer (ideally citing the chunks)

Why RAG works

Reduces hallucinations by grounding answers in real, supplied text.
Adds fresh & private knowledge without retraining.
Cheaper than finetuning for "the model needs to know X."
Updatable: change the knowledge base, not the model.
Traceable: you can show which sources were used.

The retriever is the heart of RAG

Quality depends almost entirely on retrieval. Two main approaches:

Term-based (sparse / keyword): e.g., BM25. Fast, great for exact terms, no training needed. Misses synonyms/meaning.
Embedding-based (dense / semantic): embed text into vectors; retrieve by vector similarity in a vector database. Captures meaning, but needs a good embedding model and more infrastructure.
Hybrid search: combine both, often with reranking, for the best results.

Better retrieval > bigger model

Most "RAG isn't working" problems are retrieval problems. If the right chunk never reaches the prompt, no model can answer well. Invest there first.

Practical RAG knobs

Chunking: how you split documents (size, overlap, by structure) strongly affects quality.
Top-k: how many chunks to retrieve (more context vs. noise & cost).
Reranking: re-score retrieved candidates to push the best ones to the top.
Query rewriting: reshape the user's question for better retrieval.
Metadata filtering: restrict by source, date, permissions.
Beyond text: RAG can retrieve from tabular data / SQL, APIs, and multimodal sources, not just documents.

Part 2: Agents

What is an agent?

An agent is a model that can perceive its environment and act on it using tools, often over multiple steps, to accomplish a goal. Where RAG adds information, agents take actions.

text

Goal → [Model plans] → [calls a Tool] → [observes result] → [plans again] → ... → Done

Tools: the agent's hands

Tools extend what the model can do:

Knowledge tools: search, retrieval, database queries (RAG can be a tool).
Action tools: send email, call an API, run code, edit a file.
Capability tools: calculators, code interpreters (offload what models are bad at, like exact math).

The agent's tool inventory defines its power and its risk.

Planning and reasoning

The hard part of an agent is planning: breaking a goal into steps, choosing tools, and adjusting based on results. Patterns include reasoning-then-acting loops (e.g., ReAct-style), reflection/self-correction, and plan-then-execute. More autonomy = more capability but less predictability.

Why agents are hard

Errors compound. A small mistake early can derail a long chain of steps.
Cost & latency balloon with many model calls and tool round-trips.
Security risk is higher: an agent that can act can do real damage, and is a juicy target for prompt injection (Ch 5).
Evaluation is harder: you must judge the whole trajectory, not just a final string (did it pick the right tools? take safe actions? finish efficiently?).

Give agents the least power they need

The more an agent can do, the more it can break. Restrict tools, sandbox code execution, require approval for risky actions, and assume external content may try to hijack it.

Memory

Agents (and chat apps) need memory beyond the context window:

Short-term: the current conversation/session.
Long-term: persisted facts/history, often stored externally and retrieved (RAG-like) when relevant.

RAG vs. Agents, how to choose

	RAG	Agents
Adds…	Information	Actions + information
Maturity	Production-proven	Emerging
Complexity	Lower	Higher
Cost/latency	Moderate	Higher (many calls)
Best for	"Answer using my knowledge"	"Accomplish a multi-step task"

Start with RAG. Reach for agents when the task genuinely needs multi-step action and tool use.

Takeaways

The real lever here is context: get the right info/capabilities to the model.
RAG grounds answers in retrieved data, reliable, cheap, updatable. Retrieval quality is everything; use hybrid search + reranking.
Agents let models act via tools in a loop, powerful but complex, costly, and riskier.
Agents demand trajectory-level evaluation and strict security/least privilege.
Default to RAG; escalate to agents only when the task truly requires action and planning.

Chapter 6 · RAG and Agents ​

In one minute ​

Why context matters ​

Part 1: RAG ​

The core idea ​

Why RAG works ​

The retriever is the heart of RAG ​

Practical RAG knobs ​

Part 2: Agents ​

What is an agent? ​

Tools: the agent's hands ​

Planning and reasoning ​

Why agents are hard ​

Memory ​

RAG vs. Agents, how to choose ​

Takeaways ​