07.04 — Ingestion Pipeline Introduction¶
Overview¶
Before diving into code, this lesson provides the architectural overview of the ingestion pipeline — the steps we'll implement over the next several lessons to transform a live documentation website into a searchable vector store.
The Pipeline at a Glance¶
flowchart LR
URL["🌐 LangChain Docs\n(python.langchain.com)"]
CRAWL["🕷️ Tavily Crawl\n(Map + Extract)"]
DOCS["📄 LangChain Documents"]
SPLIT["✂️ Text Splitter\n(Chunk into pieces)"]
EMBED["🔢 Embed\n(OpenAI)"]
STORE["🗄️ Vector Store\n(Pinecone)"]
URL --> CRAWL --> DOCS --> SPLIT --> EMBED --> STORE
style URL fill:#1e3a5f,color:#fff
style CRAWL fill:#4a9eff,color:#fff
style STORE fill:#10b981,color:#fff
Why Tavily?¶
In earlier versions of this course, documentation crawling was done manually — writing custom scripts to scrape HTML, handle pagination, deal with bot protection, manage dynamic rendering, etc. This was:
- Error-prone — different machines behaved differently
- Fragile — site structure changes broke the scripts
- Time-consuming — debugging crawling issues took hours
Tavily (previously Firecrawl was also considered) provides a managed crawling API that handles all of this complexity:
| Feature | Manual Approach | Tavily |
|---|---|---|
| Bot protection | 🔧 Handle yourself | ✅ Built-in |
| Dynamic rendering | 🔧 Headless browser setup | ✅ Automatic |
| Rate limiting | 🔧 Custom backoff logic | ✅ Managed |
| Concurrent extraction | 🔧 Threading code | ✅ API-level parallelism |
| Table/embed extraction | 🔧 Custom parsers | ✅ Advanced mode |
| Setup time | Hours | Minutes |
[!TIP] General principle: If crawling isn't your core business logic, offload it to a third-party that specializes in it. This frees you to focus on the RAG pipeline itself.
Tavily's Free Tier¶
Tavily offers a generous free tier that's more than sufficient for this project and course.
What's Coming Next¶
| Lesson | What We Build |
|---|---|
| 05 | Initialize all imports, embedding model, vector store, SSL config |
| 06 | Crawl the documentation using TavilyCrawl (one-call approach) |
| 07–08 | (Optional) Manual crawling with TavilyMap + TavilyExtract for more control |
| 09 | Recap transition to chunking |
| 10 | Chunk documents with RecursiveCharacterTextSplitter |
| 11 | Batch-index chunks into Pinecone concurrently |
The heavy lifting is shared between Tavily (crawling) and LangChain (chunking, embedding, indexing). Our job is to orchestrate them correctly.