06.03 — Medium Analyzer: Boilerplate Setup¶
Overview¶
This lesson sets up the development environment for the Medium Analyzer project — a complete RAG pipeline that ingests a Medium blog article about vector databases, stores it in Pinecone, and enables question-answering over it. We cover repository setup, dependency installation, Pinecone index creation, and environment variable configuration.
Step 1: Clone the Repository¶
git clone <repository-url>
cd langchain-course
git checkout -b project/rag-gist <initial-commit-hash>
Step 2: Install Dependencies¶
Dependency Table¶
| Package | Purpose |
|---|---|
langchain |
Core framework — prompts, chains, LCEL |
langchain-community |
Community loaders (e.g., TextLoader) |
langchain-openai |
OpenAI integration — ChatOpenAI + OpenAIEmbeddings |
langchain-pinecone |
Pinecone vector store integration |
python-dotenv |
Load API keys from .env file |
black / isort |
Code formatting |
[!TIP] The project uses uv as the package manager (fast Rust-based alternative to pip/poetry).
uv lockgenerates a lock file frompyproject.toml, anduv syncinstalls everything into a virtual environment at.venv/.
Step 3: Configure the IDE¶
After uv sync creates the .venv/ directory, configure your IDE to use that interpreter:
- Inside the terminal (with venv active):
which python3→ copy the path - In VS Code / Cursor:
Ctrl+Shift+P→ "Python: Select Interpreter" → paste the path
This resolves import errors in the editor.
Step 4: Create a Pinecone Index¶
What Is Pinecone?¶
Pinecone is a managed cloud vector database. You don't install or maintain anything — Pinecone handles storage, indexing, and similarity search infrastructure. It has a free tier that's sufficient for development.
Creating the Index¶
- Go to pinecone.io → Log in
- Click Create Index
- Configure:
| Setting | Value | Why |
|---|---|---|
| Index name | medium-blogs-embeddings-index |
Descriptive; matches the .env variable |
| Dimensions | 1536 |
Matches text-embedding-3-small output at full dimensionality |
| Metric | cosine |
Standard for text similarity |
| Type | Dense |
Standard embedding vectors (not sparse keyword vectors) |
| Capacity | Serverless |
Scales automatically; free tier friendly |
| Cloud / Region | AWS us-east-1 (default) |
Choose the region closest to your application |
Vector Dimensionality¶
The dimension (1536) must match the output dimensionality of the embedding model:
| Embedding Model | Default Dimensions | Settable |
|---|---|---|
text-embedding-3-small |
512 (default), up to 1536 | ✅ Yes |
text-embedding-3-large |
256 (default), up to 3072 | ✅ Yes |
text-embedding-ada-002 |
1536 (fixed) | ❌ No |
[!IMPORTANT] Longer vectors hold more semantic information but cost more storage. The dimension of the index and the embedding model must match exactly — a mismatch causes an error on insertion.
Similarity Metrics¶
| Metric | What It Measures | When to Use |
|---|---|---|
| Cosine | Angle between vectors (direction) | Text similarity — default choice |
| Euclidean | Geometric distance | When vector magnitude is meaningful |
| Dot product | Magnitude × direction | Fast; good for normalized vectors |
Production Considerations¶
- Cloud provider: Pinecone supports AWS, GCP, Azure. Choose based on compliance/privacy needs.
- Region: Deploy the vector store in the same region as your RAG application to avoid cross-region latency (egress costs).
- Capacity mode: Serverless is fine for development. Dedicated pods give predictable latency for production.
Step 5: Configure Environment Variables¶
Create a .env file:
# OpenAI (LLM + Embeddings)
OPENAI_API_KEY=sk-your-key-here
# Pinecone (Vector Store)
PINECONE_API_KEY=pcsk-your-key-here
INDEX_NAME=medium-blogs-embeddings-index
# LangSmith (Tracing - recommended)
LANGSMITH_API_KEY=ls-your-key-here
LANGSMITH_PROJECT=rag-gist
LANGSMITH_TRACING=true
| Variable | Why It's Needed |
|---|---|
OPENAI_API_KEY |
LLM calls (GPT-3.5/4) + Embedding API calls |
PINECONE_API_KEY |
Authentication for vector store operations |
INDEX_NAME |
Which Pinecone index to read from / write to |
LANGSMITH_* |
Tracing — see every step of the RAG pipeline in the LangSmith UI |
[!WARNING] The
PINECONE_API_KEYvariable name is important —langchain-pineconeexpects this exact name when auto-detecting the API key from environment variables.
Step 6: Validate the Setup¶
# ingestion.py
import os
from dotenv import load_dotenv
load_dotenv()
if __name__ == "__main__":
print("Ingestion...")
print(os.environ["PINECONE_API_KEY"][:8] + "...") # Quick check
Run: python ingestion.py → should print the first 8 characters of your Pinecone key.
Project File Structure¶
langchain-course/
├── .env ← API keys (gitignored)
├── .gitignore ← Excludes .env, .venv
├── .python-version ← Python version constraint
├── pyproject.toml ← Dependencies
├── uv.lock ← Exact dependency versions
├── ingestion.py ← Ingestion pipeline (load → split → embed → store)
├── main.py ← Retrieval pipeline (query → search → augment → generate)
└── mediumblog.txt ← Source document (Medium article about vector databases)
Summary¶
| Step | What We Did | Key Decision |
|---|---|---|
| Clone repo | Set up the starter code | Branch project/rag-gist |
| Install deps | uv lock && uv sync |
Latest LangChain versions |
| Configure IDE | Point to .venv Python interpreter |
Resolves import errors |
| Create Pinecone index | 1536 dimensions, cosine metric, serverless | Matches text-embedding-3-small output |
| Environment variables | OpenAI, Pinecone, LangSmith keys | PINECONE_API_KEY must be exact name |
| Validate | Run ingestion.py → prints key prefix |
Confirms setup works |