06.04 — Medium Analyzer: LangChain Class Review¶
Overview¶
Before writing the ingestion code, this lesson dives into the LangChain classes we'll use — examining their source code, understanding their abstractions, and learning why they make RAG pipelines portable across data sources, embedding models, and vector stores.
The Four Classes¶
flowchart LR
TL["📄 TextLoader\n(Document Loader)"]
CTS["✂️ CharacterTextSplitter\n(Text Splitter)"]
OE["🔢 OpenAIEmbeddings\n(Embedding Model)"]
PVS["🗄️ PineconeVectorStore\n(Vector Database)"]
TL --> CTS --> OE --> PVS
style TL fill:#4a9eff,color:#fff
style CTS fill:#f59e0b,color:#fff
style OE fill:#8b5cf6,color:#fff
style PVS fill:#10b981,color:#fff
1. TextLoader — Document Loaders¶
from langchain_community.document_loaders import TextLoader
loader = TextLoader("./mediumblog.txt")
documents = loader.load()
What It Does Under the Hood¶
The TextLoader source code is surprisingly simple:
# Simplified from LangChain source
class TextLoader(BaseLoader):
def load(self) -> List[Document]:
with open(self.file_path) as f:
text = f.read()
return [Document(
page_content=text,
metadata={"source": self.file_path}
)]
That's it — open a file, read the text, wrap it in a Document object with metadata. The value isn't in the complexity of TextLoader itself — it's in the uniform interface it shares with every other loader.
The Abstraction's Power¶
Every document loader, regardless of source, implements the same .load() method and returns Document objects:
# All of these return List[Document] with the same interface:
TextLoader("./file.txt").load()
PyPDFLoader("./report.pdf").load()
WhatsAppChatLoader("./chat.txt").load()
NotionDirectoryLoader("./notion-export").load()
YoutubeLoader.from_youtube_url("https://...").load()
The Document Object¶
class Document:
page_content: str # The actual text content
metadata: dict # Source info, timestamps, page numbers, etc.
| Field | Purpose | Example |
|---|---|---|
page_content |
The text data the LLM will process | "Pinecone is a vector database..." |
metadata |
Source tracking for citations, filtering, and debugging | {"source": "./mediumblog.txt"} |
The metadata field is critical for production RAG systems:
- Citations — tell the user where the answer came from
- Filtering — search only within specific documents or categories
- Debugging — trace which chunk produced a particular answer
[!TIP] You can add custom metadata to any document:
doc.metadata["category"] = "finance". This enables filtered retrieval — e.g., search only within finance documents.
2. CharacterTextSplitter — Text Splitters¶
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
chunk_size=1000,
chunk_overlap=0,
separator="\n\n"
)
chunks = text_splitter.split_documents(documents)
Key Parameters¶
| Parameter | Value | Meaning |
|---|---|---|
chunk_size |
1000 |
Maximum characters per chunk |
chunk_overlap |
0 |
No shared content between adjacent chunks |
separator |
"\n\n" |
Split on double newlines (paragraph boundaries) |
length_function |
len (default) |
How to measure chunk length (can be custom token counter) |
Why 1000 Characters?¶
The chunk size is a heuristic — there's no magic number:
flowchart LR
SMALL["100 chars\n❌ Too granular\nLacks context"]
MED["500-1500 chars\n✅ Sweet spot\nMeaningful passages"]
LARGE["10K+ chars\n❌ Too broad\nIncludes irrelevant content"]
style SMALL fill:#ef4444,color:#fff
style MED fill:#10b981,color:#fff
style LARGE fill:#ef4444,color:#fff
Rule of thumb: A chunk should be small enough to focus on one topic, but large enough that a human reading it would understand what it's about.
[!IMPORTANT] "Garbage in, garbage out" applies — even with million-token context windows. Sending 3 focused, relevant chunks produces better answers than sending 100 unfocused chunks. Chunking quality directly impacts answer quality.
Why Chunks Might Exceed chunk_size¶
When splitting on "\n\n", if a paragraph is longer than chunk_size, the splitter can't split mid-paragraph without breaking semantics. LangChain warns you when this happens:
This is expected behavior, not an error. The splitter prioritizes semantic boundaries over strict size limits.
3. OpenAIEmbeddings — Embedding Models¶
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
What It Does¶
The embedding object is a wrapper around the OpenAI embeddings API. When called, it sends text to the API and receives back a vector:
# Under the hood:
# POST https://api.openai.com/v1/embeddings
# Body: {"input": "Pinecone is a vector database...", "model": "text-embedding-3-small"}
# Response: {"data": [{"embedding": [0.12, 0.85, -0.34, ...]}]}
The Uniform Interface¶
Just like document loaders, LangChain provides a uniform interface for embeddings:
# All of these implement the same interface:
OpenAIEmbeddings(model="text-embedding-3-small")
CohereEmbeddings(model="embed-english-v3.0")
HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
Switching embedding providers means changing one line of code — the rest of the pipeline stays the same.
Embedding Model Comparison¶
| Model | Provider | Dimensions | Cost | Quality |
|---|---|---|---|---|
text-embedding-3-small |
OpenAI | 512–1536 | Very cheap | Good |
text-embedding-3-large |
OpenAI | 256–3072 | Moderate | Better |
text-embedding-ada-002 |
OpenAI | 1536 | 98% cheaper than legacy | Legacy but still popular |
embed-english-v3.0 |
Cohere | 1024 | Competitive | Good |
all-MiniLM-L6-v2 |
HuggingFace | 384 | Free (local) | Decent for prototyping |
[!NOTE] When embedding large datasets (millions of chunks), cost matters significantly.
text-embedding-3-smallis the sweet spot — good quality at very low cost. For maximum quality, usetext-embedding-3-large.
4. PineconeVectorStore — Vector Databases¶
from langchain_pinecone import PineconeVectorStore
vectorstore = PineconeVectorStore.from_documents(
documents=chunks,
embedding=embeddings,
index_name=os.environ["INDEX_NAME"]
)
What It Does¶
from_documents() performs the entire ingestion in one call:
1. Iterates through all document chunks
2. Embeds each chunk using the provided embedding model
3. Stores the vectors + metadata in the Pinecone index
The Uniform Interface¶
# All vector stores share the same interface:
PineconeVectorStore.from_documents(docs, embeddings, index_name="...")
ChromaVectorStore.from_documents(docs, embeddings, collection_name="...")
FAISSVectorStore.from_documents(docs, embeddings)
Switching vector databases means changing one import and one class name.
What Gets Stored¶
For each chunk, Pinecone stores:
| Field | Example | Purpose |
|---|---|---|
| Vector | [0.12, 0.85, -0.34, ...] (1536 floats) |
Similarity search |
| Text | "Pinecone is a managed vector database..." |
Returned with search results |
| Source | "./mediumblog.txt" |
Citation and traceability |
Summary of Imports¶
# ingestion.py
from langchain_community.document_loaders import TextLoader # Load data
from langchain.text_splitter import CharacterTextSplitter # Split into chunks
from langchain_openai import OpenAIEmbeddings # Convert to vectors
from langchain_pinecone import PineconeVectorStore # Store vectors
| Class | Abstraction | Key Method |
|---|---|---|
TextLoader |
Data source → Document |
.load() |
CharacterTextSplitter |
Document → Chunk[] |
.split_documents() |
OpenAIEmbeddings |
Text → Vector | .embed_query() / .embed_documents() |
PineconeVectorStore |
Chunks → Stored Vectors | .from_documents() |