07.10 — Chunking with RecursiveCharacterTextSplitter¶
Overview¶
With 500+ documents loaded, we now chunk them into smaller pieces for embedding and retrieval. This lesson uses RecursiveCharacterTextSplitter — LangChain's most intelligent text splitter — and discusses the broader question: is RAG dead in the age of million-token context windows?
The Chunking Code¶
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=4000,
chunk_overlap=200,
)
split_docs = text_splitter.split_documents(all_docs)
log_success(f"Split into {len(split_docs)} chunks")
# → Split into ~6,500 chunks
That's it — three lines of code. LangChain does the heavy lifting. But the parameters matter enormously.
RecursiveCharacterTextSplitter — How It Works¶
Unlike CharacterTextSplitter (which splits on a single separator), the Recursive variant tries multiple separators in priority order:
flowchart TD
TEXT["📄 Full Document"]
S1["Try split by: \\n\\n\\n\n(triple newline — section breaks)"]
S2["Try split by: \\n\\n\n(double newline — paragraphs)"]
S3["Try split by: \\n\n(single newline)"]
S4["Try split by: ' '\n(spaces — words)"]
S5["Try split by: ''\n(characters — last resort)"]
TEXT --> S1
S1 -->|"Chunks still > chunk_size"| S2
S2 -->|"Chunks still > chunk_size"| S3
S3 -->|"Chunks still > chunk_size"| S4
S4 -->|"Chunks still > chunk_size"| S5
style S1 fill:#10b981,color:#fff
style S5 fill:#ef4444,color:#fff
The recursive part: it tries the most semantic separator first (section breaks), and only falls back to less meaningful separators if chunks are still too large. This preserves document structure far better than naive splitting.
Parameter Choices¶
chunk_size = 4000¶
| Value | Tradeoff |
|---|---|
| Too small (200) | Chunks lack context — the LLM can't understand them alone |
| Too large (20,000) | May include irrelevant content; wastes tokens in the prompt |
| 4000 | Good balance for documentation — typically captures a complete class/function doc |
chunk_overlap = 200¶
Overlap ensures that information split across chunk boundaries isn't lost. With 200 characters of overlap, if a sentence straddles two chunks, both chunks contain it.
Is RAG Dead? (The Million-Token Question)¶
With models like Gemini 2.5 supporting 2 million input tokens and Anthropic supporting 1 million, a common question arises: why not just stuff the entire documentation into the prompt?
Why RAG Is Not Dead — It's Evolving¶
| Argument | Explanation |
|---|---|
| Cost efficiency | Embedding the full LangChain docs (~6,500 chunks × 4,000 chars ≈ 26M chars) as input tokens on every query would be incredibly expensive. RAG retrieves only ~4 relevant chunks. |
| Precision & noise reduction | RAG filters to only the most relevant chunks, reducing hallucinations and positional biases that occur with very long contexts |
| Intelligent reordering | Retrieval with relevance ranking improves answer quality while using fewer tokens |
| Source citations | RAG gives you source traceability — know exactly which document the answer came from. Critical for regulated environments |
| Speed | Processing 4K tokens is near-instant; processing 26M tokens takes significant time |
The Correct Perspective¶
[!IMPORTANT] Larger context windows don't kill RAG — they amplify it. Long context models are complementary to RAG. They handle richer prompt sequences, but RAG provides the precision, cost efficiency, and citability that production applications require.
flowchart LR
RAG["🔍 RAG\n(Precision, Cost,\nCitations)"]
LONG["📄 Long Context\n(Richer prompts,\nmore flexibility)"]
BEST["✅ Best of Both\n(RAG + long context\nfor production)"]
RAG --> BEST
LONG --> BEST
style BEST fill:#10b981,color:#fff
Summary¶
| Concept | Key Insight |
|---|---|
| RecursiveCharacterTextSplitter | Tries semantic separators first, falls back to less meaningful ones |
| chunk_size=4000 | Captures complete documentation sections without being too broad |
| chunk_overlap=200 | Preserves context across chunk boundaries |
| Not a silver bullet | Many advanced strategies exist (semantic chunking, small-to-big, etc.) |
| RAG is not dead | It's evolving — provides precision, cost savings, and citations that long context can't |