07.11 — Batch Indexing: Concurrent Vector Store Ingestion¶
Overview¶
With ~6,500 document chunks ready, we need to embed each chunk and store it in the vector store. This lesson implements concurrent batch indexing — the same asyncio.gather pattern used for crawling, but now applied to embedding and storing vectors. We also explore rate limiting in practice and demonstrate switching between Pinecone and ChromaDB.
The Indexing Architecture¶
flowchart TD
CHUNKS["✂️ 6,500 Chunks"]
BATCH["📦 Batch into groups\n(batch_size=500)"]
B1["Batch 1: 500 chunks"]
B2["Batch 2: 500 chunks"]
BN["Batch 13: 500 chunks"]
subgraph PerBatch["Per Batch (concurrent)"]
EMBED["🔢 Embed via OpenAI"]
STORE["🗄️ Upsert to Pinecone"]
EMBED --> STORE
end
CHUNKS --> BATCH
BATCH --> B1 & B2 & BN
B1 & B2 & BN --> PerBatch
style CHUNKS fill:#1e3a5f,color:#fff
style PerBatch fill:#10b981,color:#fff
Implementation¶
The Batch Indexing Function¶
async def index_documents_async(documents: List[Document], batch_size: int):
"""Embed and index documents into the vector store in concurrent batches."""
log_header("Vector Storage Phase")
log_info(f"Indexing {len(documents)} documents")
# Split into batches
batches = [
documents[i:i + batch_size]
for i in range(0, len(documents), batch_size)
]
log_info(f"Created {len(batches)} batches")
async def add_batch(batch: List[Document], batch_number: int) -> bool:
"""Add a single batch to the vector store."""
try:
await vectorstore.aadd_documents(batch)
log_success(f"Batch {batch_number}: indexed {len(batch)} docs")
return True
except Exception as e:
log_error(f"Batch {batch_number} failed: {e}")
return False
# Fire all batches concurrently
tasks = [
add_batch(batch, num)
for num, batch in enumerate(batches)
]
results = await asyncio.gather(*tasks)
# Report results
successful = sum(results)
if successful == len(batches):
log_success(f"All {len(batches)} batches indexed successfully!")
else:
log_warning(f"{successful}/{len(batches)} batches succeeded")
Key Design Patterns¶
| Pattern | Implementation | Why |
|---|---|---|
| Batching | Split 6,500 docs into groups of 500 | Avoid overwhelming the embedding API |
| Async execution | vectorstore.aadd_documents() |
Non-blocking IO for API calls |
| Concurrent processing | asyncio.gather(*tasks) |
All batches run simultaneously |
| Boolean tracking | Each batch returns True/False |
Count successes vs failures |
| Batch numbering | enumerate(batches) |
Identify failed batches in logs |
Calling from Main¶
async def main():
# ... crawling and chunking from previous lessons ...
# Index into vector store
await index_documents_async(
documents=split_docs,
batch_size=500
)
# Summary stats
log_header("Pipeline Complete")
log_info(f"URLs crawled: {len(sitemap['results'])}")
log_info(f"Documents chunked: {len(split_docs)}")
log_success("Ingestion pipeline finished!")
Understanding the Rate Limiting Chain¶
When aadd_documents runs, it triggers a chain of API calls that can each be rate-limited:
sequenceDiagram
participant App
participant OAI as OpenAI Embeddings API
participant PC as Pinecone
loop Per batch (500 docs)
App->>OAI: Embed 500 chunks (in sub-batches of 50)
Note over OAI: ⚠️ Rate limit check<br/>(tokens per minute)
OAI-->>App: 500 vectors
App->>PC: Upsert 500 vectors
Note over PC: ⚠️ Rate limit check<br/>(writes per second)
PC-->>App: ✅ Success
end
Where Rate Limits Hit¶
| Component | Limit Type | What Triggers It |
|---|---|---|
| OpenAI Embeddings | Tokens per minute (TPM) | Embedding too many chunks too fast |
| Pinecone | Writes per second | Upserting too many vectors too fast |
The retry_min_seconds Effect¶
Without retry_min_seconds=10 on the embeddings model:
❌ Batch 3 failed: 429 Rate limit exceeded
❌ Batch 7 failed: 429 Rate limit exceeded
❌ Batch 11 failed: 429 Rate limit exceeded
⚠️ 10/13 batches succeeded
With retry_min_seconds=10:
The retry mechanism waits long enough for the rate limit window to reset before retrying.
Batch Size Tradeoffs¶
| Batch Size | Effect | Recommendation |
|---|---|---|
| Too large (5000) | Hits embedding rate limits; single failure loses many docs | ❌ |
| Too small (10) | Too many API calls; slow overall | ❌ |
| 500 | Good balance — manageable batches, tolerable failure granularity | ✅ |
[!TIP] The optimal batch size depends on your embedding API tier (higher tiers have higher rate limits) and your vector store's write capacity. Start with 500 and adjust based on error logs.
Switching to ChromaDB¶
Swapping from Pinecone to ChromaDB is a one-line change thanks to LangChain's uniform interface:
# Pinecone (cloud)
vectorstore = PineconeVectorStore(
index_name="langchain-docs-2025", embedding=embeddings
)
# ChromaDB (local) — swap this one line
vectorstore = Chroma(
persist_directory="./chroma_db", embedding_function=embeddings
)
Everything else — aadd_documents(), as_retriever(), similarity_search() — works identically.
ChromaDB Storage¶
ChromaDB uses SQLite under the hood and persists to the ./chroma_db/ directory:
One-Way Embeddings: A Critical Concept¶
[!IMPORTANT] The embedding function is a one-way function — there is no inverse. You cannot reconstruct text from its vector. This is why vector stores always save the original text alongside the vector. When you retrieve a result, you get the text from the metadata, not by "decoding" the vector.
flowchart LR
TEXT["'Pinecone is a vector database'"]
EMBED["🔢 Embedding Model"]
VEC["[0.12, 0.85, -0.34, ...]"]
TEXT -->|"embed()"| VEC
VEC -->|"❌ No inverse function"| TEXT
style VEC fill:#ef4444,color:#fff
Verifying in Pinecone¶
After running the pipeline, check the Pinecone dashboard:
- Record count should match the number of chunks (~6,500)
- Click any record to inspect:
- Vector values — the embedding (1536 floats)
- text — original chunk content
- source — URL of the documentation page
Summary¶
| Step | Code | Result |
|---|---|---|
| Batch | docs[i:i+500] |
13 batches of 500 chunks |
| Embed + Store | await vectorstore.aadd_documents(batch) |
Concurrent embedding + upserting |
| Track results | asyncio.gather(*tasks) → count True/False |
Know which batches succeeded |
| Handle rate limits | retry_min_seconds=10 on embeddings |
Auto-retry after rate limit errors |
| Verify | Pinecone dashboard | ~6,500 records with text + source metadata |