06.01 — Introduction to Retrieval-Augmented Generation¶
Overview¶
This lesson introduces the motivation behind RAG — why it exists, what problem it solves, and why the naive approach of stuffing entire documents into prompts doesn't work. Understanding this motivation is essential before diving into the implementation.
The Problem: LLMs Can't Access Your Data¶
LLMs are trained on large public datasets, but they have no knowledge of private data — your company's documents, internal knowledge bases, financial reports, or proprietary codebases. When you ask an LLM about information in a private document, it either hallucinates or admits it doesn't know.
flowchart LR
User["👤 User: 'What potion does\nchapter 7 describe?'"]
LLM["🤖 LLM"]
User --> LLM
LLM --> H["❌ Hallucinated answer\n(never read the book)"]
style H fill:#ef4444,color:#fff
Examples of private data scenarios: - A 300-page Harry Potter book — "How do you make this specific potion?" - A financial contract — "What's the termination clause in section 4.2?" - An internal knowledge base — "What's our company's refund policy?" - A code repository — "How does the authentication module work?"
The LLM wasn't trained on any of this. We need a way to give it the relevant information at query time.
Solution 1: Stuff Everything (The Naive Approach)¶
The simplest idea: take the entire document and paste it into the prompt alongside the question.
flowchart LR
Q["❓ Question"]
DOC["📄 Entire Document\n(300 pages)"]
PROMPT["📝 Prompt:\n'Here's the entire book.\nNow answer: ...'"]
LLM["🤖 LLM"]
Q --> PROMPT
DOC --> PROMPT
PROMPT --> LLM
style PROMPT fill:#ef4444,color:#fff
Why This Fails¶
| Problem | Explanation | Impact |
|---|---|---|
| Token limit | LLMs have a hard maximum (4K, 128K, even 1M tokens). A 300-page book easily exceeds this. | The API call fails — the request is rejected |
| Needle in a haystack | Research proves that even with huge context windows, LLMs become less effective at finding specific information in very long prompts. The answer gets "lost" in the noise. | Worse answer quality — the LLM misses or misinterprets the relevant passage |
| Cost | LLM pricing is per-token. Sending 100K tokens costs ~100x more than sending 1K tokens. | Unnecessary expense — you're paying for irrelevant context |
| Latency | More tokens = longer processing time. A 100K-token prompt takes significantly longer to process than a 1K-token prompt. | Slow responses — unacceptable for user-facing applications |
[!WARNING] Even with modern models that support 1M+ token context windows (like Gemini), the "stuff everything" approach is still problematic. The Needle-in-a-Haystack research shows that performance degrades with context length — the model may know the answer is "somewhere in there" but struggle to locate it precisely. More context ≠ better answers.
Solution 2: Retrieve Only What's Relevant (RAG)¶
Instead of sending the entire document, what if we could: 1. Pre-process the document into smaller chunks 2. Find only the chunks relevant to the user's question 3. Send only those relevant chunks to the LLM
flowchart TD
subgraph Preprocessing["📥 Pre-processing (one-time)"]
DOC["📄 Full Document"] --> CHUNKS["✂️ Split into chunks\n(paragraphs, sections)"]
CHUNKS --> STORE["🗄️ Store chunks\n(searchable)"]
end
subgraph QueryTime["📤 Query Time (per question)"]
Q["❓ User Question"] --> SEARCH["🔍 Find relevant chunks"]
STORE -.-> SEARCH
SEARCH --> RC["📋 Top 3 relevant chunks"]
RC --> PROMPT["📝 Prompt:\n'Based on this context,\nanswer: ...'"]
Q --> PROMPT
PROMPT --> LLM["🤖 LLM"]
LLM --> ANS["✅ Grounded answer"]
end
style Preprocessing fill:#4a9eff,color:#fff
style QueryTime fill:#10b981,color:#fff
How RAG Solves All Four Problems¶
| Problem | How RAG Solves It |
|---|---|
| Token limit | Only 3–5 small chunks are sent, well within any token limit |
| Needle in haystack | The LLM receives only relevant passages — no noise to search through |
| Cost | 1K tokens (3 chunks) instead of 100K tokens (whole doc) → ~100x cheaper |
| Latency | Processing 1K tokens is nearly instant; 100K takes seconds |
What "RAG" Actually Means¶
The name Retrieval-Augmented Generation describes the three steps:
flowchart LR
R["🔍 Retrieval\nFind relevant chunks\nfrom the data store"]
A["📝 Augmentation\nInject chunks into the\nprompt as context"]
G["🤖 Generation\nLLM produces an answer\ngrounded in the context"]
R --> A --> G
style R fill:#4a9eff,color:#fff
style A fill:#f59e0b,color:#fff
style G fill:#10b981,color:#fff
| Step | What Happens |
|---|---|
| Retrieval | The user's query is used to search for the most relevant document chunks |
| Augmentation | The prompt is augmented (enriched) with the retrieved context |
| Generation | The LLM generates an answer grounded in the provided context |
The Open Questions¶
RAG introduces new challenges that the rest of this section addresses:
| Challenge | Question | Covered In |
|---|---|---|
| Chunking strategy | How do we split documents? What chunk size? Overlap? | Lessons 04, 05 |
| Finding relevant chunks | How do we search for the "most relevant" chunks efficiently? | Lessons 02, 07 |
| Different document types | How do we handle PDFs vs. code vs. WhatsApp messages? | Lesson 04 |
| Relevance quality | What if the retrieved chunks aren't actually relevant? | Lesson 09 (Agentic RAG teaser) |
| Scalability | Does this work with millions of chunks? | Lessons 03, 05 |
Summary¶
| Concept | Key Takeaway |
|---|---|
| The problem | LLMs don't know about private/recent data and can't process very long documents effectively |
| Naive approach | Stuffing the whole document into the prompt fails due to token limits, cost, latency, and accuracy |
| RAG solution | Split → store → retrieve relevant chunks at query time → augment the prompt → generate |
| Why it works | Focused context → better answers, lower cost, faster responses, no token limit issues |
| Tradeoffs | Requires pre-processing, chunking strategy decisions, and a search mechanism |