07.07 — TavilyMap & TavilyExtract: Manual Two-Step Crawling¶
Overview¶
While TavilyCrawl (Lesson 06) handles everything in one call, sometimes you need more granular control over the crawling process. TavilyMap and TavilyExtract split crawling into two explicit steps: first discover all URLs (map), then extract content from selected URLs (extract). This lesson covers when and why you'd use this approach.
TavilyCrawl vs TavilyMap + TavilyExtract¶
flowchart TD
subgraph OneCall["TavilyCrawl (Lesson 06)"]
C1["🌐 URL"] --> C2["🔍 Auto map\n+ filter + extract"] --> C3["📄 Documents"]
end
subgraph TwoStep["TavilyMap + TavilyExtract (This Lesson)"]
M1["🌐 URL"] --> M2["🗺️ TavilyMap\n(Discover URLs)"]
M2 --> M3["🔧 Custom filtering\n+ batching logic"]
M3 --> M4["📥 TavilyExtract\n(Extract content)"]
M4 --> M5["📄 Documents"]
end
style OneCall fill:#10b981,color:#fff
style TwoStep fill:#4a9eff,color:#fff
| Approach | Control | Complexity | Best For |
|---|---|---|---|
| TavilyCrawl | Low — Tavily decides everything | Simple — one call | Most use cases |
| Map + Extract | High — you control URL selection, batching, filtering | More code | Custom filtering, specific page selection |
Step 1: TavilyMap — Discover URLs¶
sitemap = map_client.invoke({
"url": "https://python.langchain.com",
})
urls = sitemap["results"]
print(f"Discovered {len(urls)} URLs")
# → Discovered 500 URLs
TavilyMap returns a list of all discoverable URLs on the site. It doesn't extract any content — just builds the sitemap.
What You Can Do With the URL List¶
- Filter by pattern: only URLs containing
/agents/,/tutorials/, etc. - Exclude certain sections: skip
/blog/,/changelog/, etc. - Prioritize: process the most important pages first
- Batch: group URLs for concurrent extraction
Step 2: TavilyExtract — Extract Content¶
# Extract content from specific URLs
result = extract.invoke({
"urls": ["https://python.langchain.com/docs/get-started", ...]
})
# Each result has: {"url": "...", "raw_content": "..."}
TavilyExtract accepts a list of URLs and returns the extracted content for each. It supports batch processing — you can send multiple URLs in a single API call.
Batch Processing Strategy¶
def chunk_urls(urls: list, chunk_size: int = 20) -> list:
"""Split URL list into batches."""
return [urls[i:i + chunk_size] for i in range(0, len(urls), chunk_size)]
batches = chunk_urls(urls, chunk_size=20)
# 500 URLs → 25 batches of 20 URLs each
[!NOTE] Don't make the batch size too large — the API has limits on how many URLs it can process per request.
20is a safe starting point.
When to Use Map + Extract¶
| Scenario | Use TavilyCrawl | Use Map + Extract |
|---|---|---|
| "Give me everything from this site" | ✅ | ❌ Overkill |
"I need only the /tutorials/ section" |
⚠️ Use instructions | ✅ Filter URLs after mapping |
| "I want to process URLs in custom order" | ❌ Can't control | ✅ Full control |
| "I need to resume after a failure" | ❌ Start over | ✅ Skip already-processed URLs |
| "I want to add custom metadata per section" | ❌ No control | ✅ Add metadata based on URL path |
Summary¶
| Component | Purpose | Key Insight |
|---|---|---|
| TavilyMap | Discover all URLs on a site | Returns URLs only — no content extraction |
| TavilyExtract | Extract content from specific URLs | Accepts URL lists — supports batch processing |
| When to combine | Need custom filtering, batching, or URL-level control | More code, but more flexibility |
| Default recommendation | Use TavilyCrawl unless you specifically need more control |
Simpler is better for most use cases |