07.06 — Tavily Crawl: Automated Documentation Crawling¶
Overview¶
This lesson demonstrates the simplest approach to crawling a documentation website: TavilyCrawl — a single API call that automatically maps, filters, and extracts content from an entire site. We explore the key parameters (max_depth, extract_depth, instructions) and see how to convert crawled results into LangChain Document objects for downstream processing.
What Is Web Crawling?¶
Web crawling is the automated process of browsing a website by following hyperlinks — discovering pages, extracting content, and moving deeper into the site structure. For RAG applications, crawling is how we acquire the source data.
flowchart TD
ROOT["🌐 python.langchain.com"]
P1["📄 /docs/get-started"]
P2["📄 /docs/modules/models"]
P3["📄 /docs/modules/agents"]
P4["📄 /docs/tutorials/rag"]
P5["📄 /docs/integrations"]
ROOT --> P1 & P2 & P3
P2 --> P4
P3 --> P5
style ROOT fill:#4a9eff,color:#fff
Using TavilyCrawl¶
async def main():
log_header("Documentation Ingestion Pipeline")
log_info("Using TavilyCrawl to start crawling python.langchain.com")
result = crawl.invoke({
"url": "https://python.langchain.com",
"max_depth": 5,
"extract_depth": "advanced",
})
The max_depth Parameter¶
max_depth controls how far from the root URL the crawler explores:
| max_depth | Pages Found | Time | Use Case |
|---|---|---|---|
1 |
~18 | <1 second | Quick test — top-level pages only |
2 |
~75 | Few seconds | Moderate coverage |
5 (max) |
~251 | ~26 seconds | Full documentation crawl |
flowchart TD
ROOT["🌐 Root URL\n(depth 0)"]
D1A["📄 Page A\n(depth 1)"]
D1B["📄 Page B\n(depth 1)"]
D2A["📄 Page C\n(depth 2)"]
D2B["📄 Page D\n(depth 2)"]
D3A["📄 Page E\n(depth 3)"]
ROOT --> D1A & D1B
D1A --> D2A & D2B
D2A --> D3A
style ROOT fill:#4a9eff,color:#fff
style D3A fill:#f59e0b,color:#fff
[!TIP] Best practice: Start with
max_depth=1or2for fast iteration. Only increase after reviewing the results and confirming you need deeper pages. Higher depth can be exponentially slower depending on the site's topology.
The extract_depth Parameter¶
| Value | Behavior |
|---|---|
"basic" |
Standard text extraction — fast |
"advanced" |
Extracts tables, embedded content, code blocks — more thorough but higher latency |
For documentation sites with code examples and tables, always use "advanced".
The instructions Parameter: Intelligent Filtering¶
The most powerful parameter — provide natural language instructions that guide the crawler on which pages to scrape:
result = crawl.invoke({
"url": "https://python.langchain.com",
"max_depth": 5,
"extract_depth": "advanced",
"instructions": "Search for content on AI agents",
})
Without instructions: 251 pages (everything)
With instructions: 23 pages (only agent-related documentation)
How Instructions Work¶
flowchart TD
DISCOVER["🔍 Discover URL"]
CHECK["🤔 Does this URL match\nthe instructions?"]
CRAWL["✅ Crawl this page"]
SKIP["❌ Skip this page"]
DISCOVER --> CHECK
CHECK -->|"Yes"| CRAWL
CHECK -->|"No"| SKIP
Instructions act as a URL-level filter during the mapping phase. The crawler decides for each discovered URL whether to extract its content based on the instruction.
[!IMPORTANT] Write instructions as filtering criteria, not questions. The instructions help Tavily decide which pages to crawl, not what to extract from those pages.
✅ Good:
"Content about AI agents and autonomous systems"
❌ Bad:"What are AI agents?"
Instructions + max_depth Synergy¶
With instructions active, you can safely use higher max_depth because the crawler skips irrelevant pages. The filtering offsets the depth increase.
Converting Results to LangChain Documents¶
The crawl result is a dictionary with a results key containing a list of scraped pages:
# Result structure:
# {
# "base_url": "https://python.langchain.com",
# "results": [
# {"url": "https://...", "raw_content": "Page text..."},
# {"url": "https://...", "raw_content": "Page text..."},
# ...
# ]
# }
# Convert to LangChain Documents
all_docs = [
Document(
page_content=result["raw_content"],
metadata={"source": result["url"]}
)
for result in result["results"]
]
Why the metadata.source Field Matters¶
| Purpose | How Source URL Is Used |
|---|---|
| Citations | Show users where the answer came from |
| Trust | Users can click the link and verify the answer |
| Debugging | Trace which chunk produced a wrong answer |
| Filtering | Search only within specific sections of the docs |
Summary¶
| Feature | Value | Effect |
|---|---|---|
max_depth |
1–5 | Controls how deep the crawler explores; higher = more pages but slower |
extract_depth |
"advanced" |
Extracts tables, code blocks, and embedded content |
instructions |
Natural language | Filters pages by topic — reduces noise, enables higher depth |
| Output | List[Document] |
Each page → one Document with page_content + metadata.source |
| Key Insight | Detail |
|---|---|
| Start shallow | Use max_depth=1 for fast iteration; increase only when needed |
| Instructions as filters | They decide which URLs to crawl, not what to extract |
| Always save source | The metadata.source URL enables citations and debugging |
| One-call simplicity | TavilyCrawl combines mapping + filtering + extraction in one call |