Skip to content

07.06 — Tavily Crawl: Automated Documentation Crawling

Overview

This lesson demonstrates the simplest approach to crawling a documentation website: TavilyCrawl — a single API call that automatically maps, filters, and extracts content from an entire site. We explore the key parameters (max_depth, extract_depth, instructions) and see how to convert crawled results into LangChain Document objects for downstream processing.


What Is Web Crawling?

Web crawling is the automated process of browsing a website by following hyperlinks — discovering pages, extracting content, and moving deeper into the site structure. For RAG applications, crawling is how we acquire the source data.

flowchart TD
    ROOT["🌐 python.langchain.com"]
    P1["📄 /docs/get-started"]
    P2["📄 /docs/modules/models"]
    P3["📄 /docs/modules/agents"]
    P4["📄 /docs/tutorials/rag"]
    P5["📄 /docs/integrations"]

    ROOT --> P1 & P2 & P3
    P2 --> P4
    P3 --> P5

    style ROOT fill:#4a9eff,color:#fff

Using TavilyCrawl

async def main():
    log_header("Documentation Ingestion Pipeline")
    log_info("Using TavilyCrawl to start crawling python.langchain.com")

    result = crawl.invoke({
        "url": "https://python.langchain.com",
        "max_depth": 5,
        "extract_depth": "advanced",
    })

The max_depth Parameter

max_depth controls how far from the root URL the crawler explores:

max_depth Pages Found Time Use Case
1 ~18 <1 second Quick test — top-level pages only
2 ~75 Few seconds Moderate coverage
5 (max) ~251 ~26 seconds Full documentation crawl
flowchart TD
    ROOT["🌐 Root URL\n(depth 0)"]
    D1A["📄 Page A\n(depth 1)"]
    D1B["📄 Page B\n(depth 1)"]
    D2A["📄 Page C\n(depth 2)"]
    D2B["📄 Page D\n(depth 2)"]
    D3A["📄 Page E\n(depth 3)"]

    ROOT --> D1A & D1B
    D1A --> D2A & D2B
    D2A --> D3A

    style ROOT fill:#4a9eff,color:#fff
    style D3A fill:#f59e0b,color:#fff

[!TIP] Best practice: Start with max_depth=1 or 2 for fast iteration. Only increase after reviewing the results and confirming you need deeper pages. Higher depth can be exponentially slower depending on the site's topology.

The extract_depth Parameter

Value Behavior
"basic" Standard text extraction — fast
"advanced" Extracts tables, embedded content, code blocks — more thorough but higher latency

For documentation sites with code examples and tables, always use "advanced".


The instructions Parameter: Intelligent Filtering

The most powerful parameter — provide natural language instructions that guide the crawler on which pages to scrape:

result = crawl.invoke({
    "url": "https://python.langchain.com",
    "max_depth": 5,
    "extract_depth": "advanced",
    "instructions": "Search for content on AI agents",
})

Without instructions: 251 pages (everything)
With instructions: 23 pages (only agent-related documentation)

How Instructions Work

flowchart TD
    DISCOVER["🔍 Discover URL"]
    CHECK["🤔 Does this URL match\nthe instructions?"]
    CRAWL["✅ Crawl this page"]
    SKIP["❌ Skip this page"]

    DISCOVER --> CHECK
    CHECK -->|"Yes"| CRAWL
    CHECK -->|"No"| SKIP

Instructions act as a URL-level filter during the mapping phase. The crawler decides for each discovered URL whether to extract its content based on the instruction.

[!IMPORTANT] Write instructions as filtering criteria, not questions. The instructions help Tavily decide which pages to crawl, not what to extract from those pages.

✅ Good: "Content about AI agents and autonomous systems"
❌ Bad: "What are AI agents?"

Instructions + max_depth Synergy

With instructions active, you can safely use higher max_depth because the crawler skips irrelevant pages. The filtering offsets the depth increase.


Converting Results to LangChain Documents

The crawl result is a dictionary with a results key containing a list of scraped pages:

# Result structure:
# {
#     "base_url": "https://python.langchain.com",
#     "results": [
#         {"url": "https://...", "raw_content": "Page text..."},
#         {"url": "https://...", "raw_content": "Page text..."},
#         ...
#     ]
# }

# Convert to LangChain Documents
all_docs = [
    Document(
        page_content=result["raw_content"],
        metadata={"source": result["url"]}
    )
    for result in result["results"]
]

Why the metadata.source Field Matters

Purpose How Source URL Is Used
Citations Show users where the answer came from
Trust Users can click the link and verify the answer
Debugging Trace which chunk produced a wrong answer
Filtering Search only within specific sections of the docs

Summary

Feature Value Effect
max_depth 1–5 Controls how deep the crawler explores; higher = more pages but slower
extract_depth "advanced" Extracts tables, code blocks, and embedded content
instructions Natural language Filters pages by topic — reduces noise, enables higher depth
Output List[Document] Each page → one Document with page_content + metadata.source
Key Insight Detail
Start shallow Use max_depth=1 for fast iteration; increase only when needed
Instructions as filters They decide which URLs to crawl, not what to extract
Always save source The metadata.source URL enables citations and debugging
One-call simplicity TavilyCrawl combines mapping + filtering + extraction in one call