06.02 — Introduction to RAG Implementation: Embeddings, Vector Stores & Similarity Search¶
Overview¶
Lesson 01 explained why RAG exists. This lesson explains how it works — the technical mechanisms that make "finding relevant chunks" possible. We cover four foundational concepts: document loaders, text splitters, embeddings, and vector databases.
The Four Building Blocks of RAG¶
flowchart LR
DL["📄 Document Loader\n(Load data from any source)"]
TS["✂️ Text Splitter\n(Break into chunks)"]
EM["🔢 Embedding Model\n(Text → Vectors)"]
VD["🗄️ Vector Database\n(Store + Search)"]
DL --> TS --> EM --> VD
style DL fill:#4a9eff,color:#fff
style TS fill:#f59e0b,color:#fff
style EM fill:#8b5cf6,color:#fff
style VD fill:#10b981,color:#fff
1. Document Loaders¶
A document loader is a LangChain abstraction for loading data from any source and converting it into a standardized Document object.
from langchain_community.document_loaders import TextLoader
loader = TextLoader("./mediumblog.txt")
documents = loader.load()
# documents[0].page_content → "The full text..."
# documents[0].metadata → {"source": "./mediumblog.txt"}
Why It Matters¶
Data comes in countless formats — text files, PDFs, Google Drive docs, Notion notebooks, WhatsApp exports, YouTube transcripts, Slack messages. The document loader provides a uniform interface: regardless of the source, you get a Document with page_content (the text) and metadata (source info, timestamps, etc.).
| Source | Loader | Interface |
|---|---|---|
| Text file | TextLoader |
loader.load() → Document[] |
PyPDFLoader |
loader.load() → Document[] |
|
| Google Drive | GoogleDriveLoader |
loader.load() → Document[] |
| YouTube | YoutubeLoader |
loader.load() → Document[] |
| Notion | NotionDirectoryLoader |
loader.load() → Document[] |
The interface is always .load() → list of Document objects. Switching data sources means changing one import and one class name — the rest of the pipeline stays the same.
2. Text Splitters¶
Once loaded, documents are typically too large for embedding and retrieval. Text splitters break them into smaller, semantically meaningful chunks.
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
chunk_size=1000, # Max characters per chunk
chunk_overlap=0, # No overlap between chunks
separator="\n\n" # Split on double newlines
)
chunks = text_splitter.split_documents(documents)
# len(chunks) → 20 (for a typical blog post)
Chunk Size Trade-offs¶
Choosing the right chunk size is a critical design decision:
| Chunk Size | Advantage | Disadvantage |
|---|---|---|
| Too small (100 chars) | Very precise retrieval | Chunks lack context; the LLM can't understand them in isolation |
| Too large (10K chars) | Rich context per chunk | May include irrelevant content; wastes tokens; harder to search |
| Sweet spot (~500–1500 chars) | Readable, meaningful passages | Requires experimentation per document type |
Chunk Overlap¶
The chunk_overlap parameter controls how much adjacent chunks share:
Overlap ensures that information split across chunk boundaries isn't lost. Useful when sentences or ideas span the boundary, but increases total chunk count and storage.
[!TIP] Rule of thumb: Start with
chunk_size=1000andchunk_overlap=200. Adjust based on your document structure. For code, consider usingRecursiveCharacterTextSplitterwith language support.
3. Embeddings¶
Embeddings are the magic that makes similarity search possible. An embedding model converts text into a high-dimensional vector (a list of numbers) that captures the semantic meaning of the text.
flowchart LR
T1["'I want a large coffee'"]
T2["'I'll have a tall coffee'"]
T3["'Quiero pedir café grande'"]
EM["🔢 Embedding Model"]
T1 --> EM
T2 --> EM
T3 --> EM
EM --> V1["[0.12, 0.85, -0.34, ...]"]
EM --> V2["[0.11, 0.83, -0.31, ...]"]
EM --> V3["[0.13, 0.84, -0.33, ...]"]
style EM fill:#8b5cf6,color:#fff
The Key Property: Semantic Similarity → Vector Proximity¶
In a good embedding model, texts with similar meaning produce vectors that are close together in vector space, regardless of the exact words used — or even the language.
| Sentence | Language | Meaning | Vectors |
|---|---|---|---|
| "I want a large coffee" | English | Coffee order | [0.12, 0.85, -0.34, ...] |
| "I'll have a tall coffee" | English | Coffee order | [0.11, 0.83, -0.31, ...] → close |
| "Quiero pedir café grande" | Spanish | Coffee order | [0.13, 0.84, -0.33, ...] → close |
| "The stock market crashed" | English | Finance | [-0.67, 0.21, 0.93, ...] → far |
How Embeddings Enable RAG¶
This is the crucial insight: if the user asks "How tall is the Burj Khalifa?" and one of our chunks contains the Wikipedia paragraph about the Burj Khalifa's height, then:
- The question's embedding vector will be close to the chunk's embedding vector
- A similarity search in the vector database will return that chunk
- We inject the chunk into the prompt → the LLM gives an accurate, grounded answer
flowchart TD
Q["❓ Query: 'How tall is the Burj Khalifa?'"]
QV["🔢 Query Vector: [0.45, 0.78, ...]"]
Q --> QV
QV --> SEARCH["🔍 Vector Similarity Search"]
C1["Chunk: 'Burj Khalifa is 828m...'<br/>Vector: [0.44, 0.79, ...]<br/>Distance: 0.02 ✅"]
C2["Chunk: 'Coffee recipes...'<br/>Vector: [-0.31, 0.12, ...]<br/>Distance: 0.89 ❌"]
C3["Chunk: 'Dubai tourism...'<br/>Vector: [0.38, 0.71, ...]<br/>Distance: 0.12 ⚠️"]
SEARCH --> C1
SEARCH --> C3
style C1 fill:#10b981,color:#fff
style C2 fill:#ef4444,color:#fff
Distance Metrics¶
Vector databases support different ways to measure "closeness":
| Metric | What It Measures | Best For |
|---|---|---|
| Cosine similarity | Angle between vectors (direction, not magnitude) | Text similarity — most common default |
| Euclidean distance | Straight-line distance in vector space | When magnitude matters |
| Dot product | Combination of magnitude and direction | Optimized retrieval in some databases |
[!NOTE] For text embeddings, cosine similarity is the standard choice. It captures "are these texts about the same thing?" regardless of length differences.
4. Vector Databases¶
A vector database is a specialized database designed to: 1. Store millions of embedding vectors with their metadata 2. Search for the closest vectors to a query vector — extremely fast 3. Scale to billions of vectors with sub-second query times
flowchart TD
subgraph INGEST["Ingestion (one-time)"]
C["Chunks"] --> E["Embed"] --> S["Store in Vector DB"]
end
subgraph QUERY["Query Time"]
QT["Query text"] --> QE["Embed query"]
QE --> SIM["Similarity search\n(Top K nearest vectors)"]
SIM --> RES["Return chunks + metadata"]
end
S -.-> SIM
style INGEST fill:#4a9eff,color:#fff
style QUERY fill:#10b981,color:#fff
What's Stored Per Vector¶
Each entry in a vector database contains:
{
"id": "chunk_17",
"values": [0.12, 0.85, -0.34, ...], // The embedding vector (1536 dimensions)
"metadata": {
"text": "The Burj Khalifa, at 828 meters, is the tallest building...",
"source": "./documents/architecture.txt"
}
}
values— the embedding vector (used for similarity search)metadata.text— the original chunk text (returned to the application)metadata.source— where the chunk came from (for citations and traceability)
Popular Vector Databases¶
| Database | Type | Free Tier | Notes |
|---|---|---|---|
| Pinecone | Managed cloud | ✅ Yes | Used in this course; simple setup |
| Chroma | Open source (local) | ✅ Free | Great for development; runs locally |
| FAISS | Open source (library) | ✅ Free | Facebook's library; fast but no server |
| Weaviate | Open source + managed | ✅ Yes | Full-featured with GraphQL |
| Qdrant | Open source + managed | ✅ Yes | Rust-based; high performance |
Putting It All Together: The Complete RAG Pipeline¶
flowchart TD
subgraph Pipeline["Complete RAG Pipeline"]
direction TB
DOC["📄 Document\n(mediumblog.txt)"]
DOC --> LOAD["1️⃣ Document Loader\n(TextLoader)"]
LOAD --> SPLIT["2️⃣ Text Splitter\n(CharacterTextSplitter)"]
SPLIT --> EMBED["3️⃣ Embedding Model\n(OpenAI text-embedding-3-small)"]
EMBED --> STORE["4️⃣ Vector Database\n(Pinecone)"]
Q["❓ User Query"] --> QEMB["5️⃣ Embed Query"]
QEMB --> SIM["6️⃣ Similarity Search\n(Top K chunks)"]
STORE -.-> SIM
SIM --> AUG["7️⃣ Augment Prompt\n(context + question)"]
Q --> AUG
AUG --> LLM["8️⃣ LLM Generation\n(GPT-3.5/4)"]
LLM --> ANS["✅ Grounded Answer"]
end
Steps 1–4 happen once (ingestion). Steps 5–8 happen per query (retrieval + generation).
Summary¶
| Concept | What It Does | LangChain Class |
|---|---|---|
| Document Loader | Loads data from any source into Document objects |
TextLoader, PyPDFLoader, etc. |
| Text Splitter | Breaks documents into smaller, searchable chunks | CharacterTextSplitter |
| Embedding Model | Converts text into numerical vectors capturing meaning | OpenAIEmbeddings |
| Vector Database | Stores vectors and performs fast similarity search | PineconeVectorStore |
| Similarity Search | Finds the closest vectors to a query → retrieves relevant chunks | vectorstore.similarity_search() |