Skip to content

07.07 — TavilyMap & TavilyExtract: Manual Two-Step Crawling

Overview

While TavilyCrawl (Lesson 06) handles everything in one call, sometimes you need more granular control over the crawling process. TavilyMap and TavilyExtract split crawling into two explicit steps: first discover all URLs (map), then extract content from selected URLs (extract). This lesson covers when and why you'd use this approach.


TavilyCrawl vs TavilyMap + TavilyExtract

flowchart TD
    subgraph OneCall["TavilyCrawl (Lesson 06)"]
        C1["🌐 URL"] --> C2["🔍 Auto map\n+ filter + extract"] --> C3["📄 Documents"]
    end

    subgraph TwoStep["TavilyMap + TavilyExtract (This Lesson)"]
        M1["🌐 URL"] --> M2["🗺️ TavilyMap\n(Discover URLs)"]
        M2 --> M3["🔧 Custom filtering\n+ batching logic"]
        M3 --> M4["📥 TavilyExtract\n(Extract content)"]
        M4 --> M5["📄 Documents"]
    end

    style OneCall fill:#10b981,color:#fff
    style TwoStep fill:#4a9eff,color:#fff
Approach Control Complexity Best For
TavilyCrawl Low — Tavily decides everything Simple — one call Most use cases
Map + Extract High — you control URL selection, batching, filtering More code Custom filtering, specific page selection

Step 1: TavilyMap — Discover URLs

sitemap = map_client.invoke({
    "url": "https://python.langchain.com",
})

urls = sitemap["results"]
print(f"Discovered {len(urls)} URLs")
# → Discovered 500 URLs

TavilyMap returns a list of all discoverable URLs on the site. It doesn't extract any content — just builds the sitemap.

What You Can Do With the URL List

  • Filter by pattern: only URLs containing /agents/, /tutorials/, etc.
  • Exclude certain sections: skip /blog/, /changelog/, etc.
  • Prioritize: process the most important pages first
  • Batch: group URLs for concurrent extraction

Step 2: TavilyExtract — Extract Content

# Extract content from specific URLs
result = extract.invoke({
    "urls": ["https://python.langchain.com/docs/get-started", ...]
})

# Each result has: {"url": "...", "raw_content": "..."}

TavilyExtract accepts a list of URLs and returns the extracted content for each. It supports batch processing — you can send multiple URLs in a single API call.

Batch Processing Strategy

def chunk_urls(urls: list, chunk_size: int = 20) -> list:
    """Split URL list into batches."""
    return [urls[i:i + chunk_size] for i in range(0, len(urls), chunk_size)]

batches = chunk_urls(urls, chunk_size=20)
# 500 URLs → 25 batches of 20 URLs each

[!NOTE] Don't make the batch size too large — the API has limits on how many URLs it can process per request. 20 is a safe starting point.


When to Use Map + Extract

Scenario Use TavilyCrawl Use Map + Extract
"Give me everything from this site" ❌ Overkill
"I need only the /tutorials/ section" ⚠️ Use instructions ✅ Filter URLs after mapping
"I want to process URLs in custom order" ❌ Can't control ✅ Full control
"I need to resume after a failure" ❌ Start over ✅ Skip already-processed URLs
"I want to add custom metadata per section" ❌ No control ✅ Add metadata based on URL path

Summary

Component Purpose Key Insight
TavilyMap Discover all URLs on a site Returns URLs only — no content extraction
TavilyExtract Extract content from specific URLs Accepts URL lists — supports batch processing
When to combine Need custom filtering, batching, or URL-level control More code, but more flexibility
Default recommendation Use TavilyCrawl unless you specifically need more control Simpler is better for most use cases