06.05 — Medium Analyzer: Ingestion Implementation¶
Overview¶
This lesson implements the complete ingestion pipeline — loading the Medium blog from a text file, splitting it into chunks, embedding each chunk, and storing the vectors in Pinecone. The entire pipeline runs in just a few lines of LangChain code, but understanding what each line does is critical for debugging and production usage.
The Ingestion Pipeline¶
flowchart LR
FILE["📄 mediumblog.txt"]
LOAD["1️⃣ TextLoader.load()"]
SPLIT["2️⃣ CharacterTextSplitter\n.split_documents()"]
EMBED["3️⃣ OpenAIEmbeddings"]
STORE["4️⃣ PineconeVectorStore\n.from_documents()"]
FILE --> LOAD --> SPLIT --> EMBED --> STORE
LOAD -->|"1 Document"| SPLIT
SPLIT -->|"20 Chunks"| EMBED
EMBED -->|"20 Vectors\n(1536-dim each)"| STORE
style FILE fill:#1e3a5f,color:#fff
style STORE fill:#10b981,color:#fff
Step 1: Load the Document¶
After loading:
- document is a list containing one Document object
- document[0].page_content contains the full article text
- document[0].metadata → {"source": "./mediumblog.txt"}
[!TIP] Encoding issues: Some operating systems may throw a
UnicodeDecodeError. Fix with:
Step 2: Split into Chunks¶
text_splitter = CharacterTextSplitter(
chunk_size=1000,
chunk_overlap=0
)
chunks = text_splitter.split_documents(document)
print(f"Created {len(chunks)} chunks") # → 20 chunks
What Happens Under the Hood¶
- The splitter walks through the text, looking for the separator (
"\n\n"by default) - It groups text between separators until close to
chunk_size - Each group becomes a new
Documentwith the same metadata as the original - The
page_contentis the chunk's text; themetadataretains the source
Examining the Chunks¶
# Each chunk is a Document:
chunks[0].page_content # First ~1000 chars of the article
chunks[0].metadata # {"source": "./mediumblog.txt"}
len(chunks) # 20 chunks for a typical blog post
Important observations:
- Chunks are still Document objects — same type as the input
- Each chunk inherits the metadata from the parent document
- Some chunks may exceed chunk_size if a paragraph is longer than the limit (LangChain warns about this)
The "Readable Chunk" Test¶
Rule of thumb: Read the content of each chunk. Does it make sense as a standalone passage? If a human can understand the chunk's topic, it's a good chunk size. If it's just a sentence fragment, the chunks are too small. If it covers 3 unrelated topics, they're too large.
Step 3: Initialize the Embedding Model¶
The embedding model is initialized but doesn't run yet — it's passed to the vector store, which calls it during ingestion.
[!NOTE] The default model is
text-embedding-ada-002. The newertext-embedding-3-smalloffers better quality per dollar. Either works — just ensure the Pinecone index dimensions match (1536 for both at full dimensionality).
Step 4: Embed and Store in Pinecone¶
print(f"Ingesting {len(chunks)} chunks...")
vectorstore = PineconeVectorStore.from_documents(
documents=chunks,
embedding=embeddings,
index_name=os.environ["INDEX_NAME"]
)
print("Ingestion complete!")
What from_documents() Does¶
sequenceDiagram
participant App as Application
participant OAI as OpenAI API
participant PC as Pinecone
loop For each chunk (batched)
App->>OAI: Embed chunk text
OAI-->>App: Vector [0.12, 0.85, ...]
end
App->>PC: Upsert all vectors + metadata
PC-->>App: Success (20 vectors stored)
- Iterates through all chunks
- Embeds each chunk by calling the OpenAI embeddings API (batched for efficiency)
- Upserts all vectors into the Pinecone index with metadata (text + source)
Built-In Production Features¶
LangChain's from_documents() includes:
- Batching — sends chunks in batches to avoid rate limits
- Threading — can use concurrent IO for faster ingestion
- Rate limit handling — retries with backoff when API limits are hit
- Async support — from_documents() has an async variant for non-blocking ingestion
Verifying in Pinecone Dashboard¶
After running the ingestion, go to the Pinecone dashboard:
- The index should show 20 vectors (or however many chunks were created)
- Click a vector to see its stored data:
- text — the chunk's page_content
- source — the file path from metadata
- vector values — the 1536-dimensional embedding
The Complete ingestion.py¶
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
load_dotenv()
if __name__ == "__main__":
print("Ingestion...")
# 1. Load the document
loader = TextLoader("./mediumblog.txt")
document = loader.load()
print(f"Loaded {len(document)} document(s)")
# 2. Split into chunks
text_splitter = CharacterTextSplitter(
chunk_size=1000,
chunk_overlap=0
)
chunks = text_splitter.split_documents(document)
print(f"Split into {len(chunks)} chunks")
# 3. Initialize embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# 4. Embed and store in Pinecone
print(f"Ingesting {len(chunks)} chunks into Pinecone...")
PineconeVectorStore.from_documents(
documents=chunks,
embedding=embeddings,
index_name=os.environ["INDEX_NAME"]
)
print("Ingestion complete!")
What We Just Built¶
flowchart TD
subgraph Before["Before Ingestion"]
EMPTY["🗄️ Empty Pinecone Index\n(0 vectors)"]
end
subgraph After["After Ingestion"]
FULL["🗄️ Populated Index\n(20 vectors, 1536-dim each)\nwith text + source metadata"]
end
Before -->|"python ingestion.py"| After
style Before fill:#ef4444,color:#fff
style After fill:#10b981,color:#fff
The ingestion pipeline is a one-time process. Once the vectors are stored, we can query them from the retrieval pipeline (next lessons) without re-running ingestion.
Summary¶
| Step | Code | Result |
|---|---|---|
| Load | TextLoader("./mediumblog.txt").load() |
1 Document object |
| Split | CharacterTextSplitter(1000, 0).split_documents(doc) |
20 Document chunks |
| Embed + Store | PineconeVectorStore.from_documents(chunks, embeddings, index) |
20 vectors in Pinecone |
| Concept | Key Insight |
|---|---|
| Encoding | Use encoding="utf-8" if you get UnicodeDecodeError |
| Chunk size | 1000 chars is a heuristic — adjust per document type |
| Oversized chunks | Expected when paragraphs exceed chunk_size |
| Batch ingestion | LangChain handles batching, threading, and rate limits |
| One-time | Ingestion runs once; retrieval queries the stored vectors |