06.01 — Introduction to Retrieval-Augmented Generation¶

Overview¶

This lesson introduces the motivation behind RAG — why it exists, what problem it solves, and why the naive approach of stuffing entire documents into prompts doesn't work. Understanding this motivation is essential before diving into the implementation.

The Problem: LLMs Can't Access Your Data¶

LLMs are trained on large public datasets, but they have no knowledge of private data — your company's documents, internal knowledge bases, financial reports, or proprietary codebases. When you ask an LLM about information in a private document, it either hallucinates or admits it doesn't know.

flowchart LR
    User["👤 User: 'What potion does\nchapter 7 describe?'"]
    LLM["🤖 LLM"]

    User --> LLM
    LLM --> H["❌ Hallucinated answer\n(never read the book)"]

    style H fill:#ef4444,color:#fff

Examples of private data scenarios: - A 300-page Harry Potter book — "How do you make this specific potion?" - A financial contract — "What's the termination clause in section 4.2?" - An internal knowledge base — "What's our company's refund policy?" - A code repository — "How does the authentication module work?"

The LLM wasn't trained on any of this. We need a way to give it the relevant information at query time.

Solution 1: Stuff Everything (The Naive Approach)¶

The simplest idea: take the entire document and paste it into the prompt alongside the question.

flowchart LR
    Q["❓ Question"]
    DOC["📄 Entire Document\n(300 pages)"]
    PROMPT["📝 Prompt:\n'Here's the entire book.\nNow answer: ...'"]
    LLM["🤖 LLM"]

    Q --> PROMPT
    DOC --> PROMPT
    PROMPT --> LLM

    style PROMPT fill:#ef4444,color:#fff

Why This Fails¶

Problem	Explanation	Impact
Token limit	LLMs have a hard maximum (4K, 128K, even 1M tokens). A 300-page book easily exceeds this.	The API call fails — the request is rejected
Needle in a haystack	Research proves that even with huge context windows, LLMs become less effective at finding specific information in very long prompts. The answer gets "lost" in the noise.	Worse answer quality — the LLM misses or misinterprets the relevant passage
Cost	LLM pricing is per-token. Sending 100K tokens costs ~100x more than sending 1K tokens.	Unnecessary expense — you're paying for irrelevant context
Latency	More tokens = longer processing time. A 100K-token prompt takes significantly longer to process than a 1K-token prompt.	Slow responses — unacceptable for user-facing applications

[!WARNING] Even with modern models that support 1M+ token context windows (like Gemini), the "stuff everything" approach is still problematic. The Needle-in-a-Haystack research shows that performance degrades with context length — the model may know the answer is "somewhere in there" but struggle to locate it precisely. More context ≠ better answers.

Solution 2: Retrieve Only What's Relevant (RAG)¶

Instead of sending the entire document, what if we could: 1. Pre-process the document into smaller chunks 2. Find only the chunks relevant to the user's question 3. Send only those relevant chunks to the LLM

flowchart TD
    subgraph Preprocessing["📥 Pre-processing (one-time)"]
        DOC["📄 Full Document"] --> CHUNKS["✂️ Split into chunks\n(paragraphs, sections)"]
        CHUNKS --> STORE["🗄️ Store chunks\n(searchable)"]
    end

    subgraph QueryTime["📤 Query Time (per question)"]
        Q["❓ User Question"] --> SEARCH["🔍 Find relevant chunks"]
        STORE -.-> SEARCH
        SEARCH --> RC["📋 Top 3 relevant chunks"]
        RC --> PROMPT["📝 Prompt:\n'Based on this context,\nanswer: ...'"]
        Q --> PROMPT
        PROMPT --> LLM["🤖 LLM"]
        LLM --> ANS["✅ Grounded answer"]
    end

    style Preprocessing fill:#4a9eff,color:#fff
    style QueryTime fill:#10b981,color:#fff

How RAG Solves All Four Problems¶

Problem	How RAG Solves It
Token limit	Only 3–5 small chunks are sent, well within any token limit
Needle in haystack	The LLM receives only relevant passages — no noise to search through
Cost	1K tokens (3 chunks) instead of 100K tokens (whole doc) → ~100x cheaper
Latency	Processing 1K tokens is nearly instant; 100K takes seconds

What "RAG" Actually Means¶

The name Retrieval-Augmented Generation describes the three steps:

flowchart LR
    R["🔍 Retrieval\nFind relevant chunks\nfrom the data store"]
    A["📝 Augmentation\nInject chunks into the\nprompt as context"]
    G["🤖 Generation\nLLM produces an answer\ngrounded in the context"]

    R --> A --> G

    style R fill:#4a9eff,color:#fff
    style A fill:#f59e0b,color:#fff
    style G fill:#10b981,color:#fff

Step	What Happens
Retrieval	The user's query is used to search for the most relevant document chunks
Augmentation	The prompt is augmented (enriched) with the retrieved context
Generation	The LLM generates an answer grounded in the provided context

The Open Questions¶

RAG introduces new challenges that the rest of this section addresses:

Challenge	Question	Covered In
Chunking strategy	How do we split documents? What chunk size? Overlap?	Lessons 04, 05
Finding relevant chunks	How do we search for the "most relevant" chunks efficiently?	Lessons 02, 07
Different document types	How do we handle PDFs vs. code vs. WhatsApp messages?	Lesson 04
Relevance quality	What if the retrieved chunks aren't actually relevant?	Lesson 09 (Agentic RAG teaser)
Scalability	Does this work with millions of chunks?	Lessons 03, 05

Summary¶

Concept	Key Takeaway
The problem	LLMs don't know about private/recent data and can't process very long documents effectively
Naive approach	Stuffing the whole document into the prompt fails due to token limits, cost, latency, and accuracy
RAG solution	Split → store → retrieve relevant chunks at query time → augment the prompt → generate
Why it works	Focused context → better answers, lower cost, faster responses, no token limit issues
Tradeoffs	Requires pre-processing, chunking strategy decisions, and a search mechanism