11.05 — LangSmith Tracing & Execution¶

Overview¶

This is the payoff lesson — we invoke the compiled graph with a real input, watch it execute through multiple generate → reflect cycles, and use LangSmith to trace and analyze every step of the execution.

Invoking the Graph¶

if __name__ == "__main__":
    inputs = {
        "messages": [
            HumanMessage(
                content="Make this tweet better: "
                "LangChain just announced a huge update: tool calling! "
                "This gives us a single interface for function calling "
                "across all supported LLMs — OpenAI, Gemini, Anthropic. "
                "No more vendor-specific code. This is big for AI devs!"
            )
        ]
    }

    response = graph.invoke(inputs)

What Happens When We Call `graph.invoke(inputs)`¶

The moment you call invoke, LangGraph executes the entire graph from start to finish:

sequenceDiagram
    participant App as main.py
    participant G as Graph Engine
    participant Gen as Generation Node
    participant Ref as Reflection Node
    participant LLM as GPT-3.5 (OpenAI)

    App->>G: invoke(inputs)

    Note over G: START → generate
    G->>Gen: state with 1 message
    Gen->>LLM: API call #1 (generate tweet v1)
    LLM-->>Gen: AIMessage (tweet v1)

    Note over G: should_continue(2 msgs) → reflect
    G->>Ref: state with 2 messages
    Ref->>LLM: API call #2 (critique tweet v1)
    LLM-->>Ref: Critique → cast to HumanMessage

    Note over G: reflect → generate
    G->>Gen: state with 3 messages
    Gen->>LLM: API call #3 (revise → tweet v2)
    LLM-->>Gen: AIMessage (tweet v2)

    Note over G: should_continue(4 msgs) → reflect
    G->>Ref: state with 4 messages
    Ref->>LLM: API call #4 (critique tweet v2)
    LLM-->>Ref: Critique → cast to HumanMessage

    Note over G: reflect → generate
    G->>Gen: state with 5 messages
    Gen->>LLM: API call #5 (revise → tweet v3)
    LLM-->>Gen: AIMessage (tweet v3 — final)

    Note over G: should_continue(6 msgs) → END
    G-->>App: Final state with 6 messages

Total LLM API calls: 5 (3 generation + 2 reflection). This is why the execution takes ~20 seconds — each API call takes 3–4 seconds.

What LangSmith Shows¶

When LANGCHAIN_TRACING_V2=true is set, every step of the graph execution is automatically logged to LangSmith. Let's explore what the trace reveals.

The Trace View¶

In the LangSmith dashboard, under the "reflection-agent" project, you'll see a trace that looks like:

📊 Trace: graph.invoke (20.3s)
├── 🔵 generate (3.2s)
│   └── 🤖 ChatOpenAI (3.1s)
├── 🟡 should_continue
├── 🔵 reflect (3.8s)
│   └── 🤖 ChatOpenAI (3.7s)
├── 🔵 generate (4.1s)
│   └── 🤖 ChatOpenAI (4.0s)
├── 🟡 should_continue
├── 🔵 reflect (3.5s)
│   └── 🤖 ChatOpenAI (3.4s)
├── 🔵 generate (3.9s)
│   └── 🤖 ChatOpenAI (3.8s)
└── 🟡 should_continue → END

LangSmith traces every component: - Each node execution (generate, reflect) with timing - Each LLM call (ChatOpenAI) with input/output - Each routing decision (should_continue) with the result - The total execution time for the entire graph

What to Look For in the Trace¶

1. The Growing Prompt:

The most illuminating part of the trace is watching the prompt grow with each iteration. Click on the last ChatOpenAI call to see the full prompt the LLM received:

Message #	Type	Content
1	System	"You are a Twitter techie influencer assistant..."
2	Human	"Make this tweet better: LangChain just announced..."
3	AI	Tweet v1 (first draft)
4	Human	Critique: "Good start, but needs: 1) shorter length, 2) stronger hook..."
5	AI	Tweet v2 (revised based on critique)
6	Human	Critique: "Much better! But could still improve: 1) add emojis, 2) clearer CTA..."

By the last call, the LLM has the complete history of the evolution — every draft, every critique. This context allows it to produce a significantly better final version.

2. The HumanMessage Casting in Action:

In the trace, you can clearly see that reflection outputs are tagged as Human messages, even though they were generated by the AI. This confirms the prompt engineering technique from lesson 04 is working as intended.

3. Per-Node Timing:

The trace shows how long each node takes. Since every node makes an LLM call, the time is dominated by API latency. If you see one call taking significantly longer, it's usually because the prompt was longer (more history to process).

Analyzing the Quality Improvement¶

Let's examine what a typical execution produces:

Original Tweet (User Input)¶

"LangChain just announced a huge update: tool calling! This gives us a single interface for function calling across all supported LLMs — OpenAI, Gemini, Anthropic. No more vendor-specific code. This is big for AI devs!"

After Generation v1¶

"🚀 Game-changing alert! LangChain's new tool calling feature unifies function calling across ALL major LLMs — GPT, Gemini, Claude — with a SINGLE interface. No more vendor lock-in. The future of AI dev is here. #LangChain #AI #DevTools"

After Reflection + Generation v2¶

"One API to call them all. ⚡ LangChain just shipped tool calling — one interface for function calling across GPT, Gemini, and Claude. Write your tool logic once, run it everywhere. This changes everything for AI engineers. Who's already tried it? 👇 #LangChain #AI"

After Reflection + Generation v3 (Final)¶

"Write once, call anywhere. LangChain just dropped tool calling ⚡ One interface. Every major LLM. Zero vendor lock-in. The era of portable AI tools starts now → link #AI #LangChain #DevTools"

What improved across iterations: 1. Length — got progressively shorter and punchier 2. Hook — moved from generic ("Game-changing alert!") to specific and intriguing ("Write once, call anywhere.") 3. Clarity — the message became more focused on the key value proposition 4. Engagement — added a call to action, question, and relevant hashtags 5. Style — evolved from press-release style to authentic tech influencer voice

LangSmith as a Debugging Tool¶

Beyond just viewing traces, LangSmith is essential for debugging reflection agents:

Issue	What LangSmith Shows
Agent not improving	Check if the reflection critique is too vague — the LLM might not have specific enough feedback to work with
Agent going in circles	Check if revisions are just random variations rather than directed improvements
Execution too slow	Check per-node timing — maybe the prompt is too long or the model is overloaded
Unexpected routing	Check `should_continue` output — verify it's returning the expected node name
Quality plateau	See if later critiques are just nitpicking — might need fewer iterations

[!TIP] LangSmith's Playground feature lets you re-run any individual LLM call with modified inputs. This is perfect for experimenting with different system prompts or message histories without re-running the entire graph.

Could We Do This Without LangGraph?¶

Yes — but it would be more work and less clean:

# Without LangGraph (manual loop):
messages = [HumanMessage(content="Make this tweet better: ...")]
for i in range(3):
    response = generate_chain.invoke({"messages": messages})
    messages.append(response)
    if i < 2:  # Don't reflect on the last iteration
        critique = reflect_chain.invoke({"messages": messages})
        messages.append(HumanMessage(content=critique.content))

This works, but: - No built-in visualization — you can't draw_mermaid() to see the flow - No built-in tracing — LangSmith's graph-aware tracing shows node boundaries; a manual loop shows a flat sequence - Harder to modify — adding conditions, new nodes, or parallel branches requires rewriting the loop - No state management — you handle appending messages manually, which is error-prone in complex graphs - No serialization — LangGraph can checkpoint state, enabling human-in-the-loop or error recovery

LangGraph makes the pattern declarative — you describe what should happen (nodes + edges), not how to implement the loop.

Summary¶

What We Did	What We Learned
Invoked `graph.invoke(inputs)`	The graph executes all nodes and edges autonomously — 5 LLM calls in ~20 seconds
Viewed the LangSmith trace	Every node, routing decision, and LLM call is logged with inputs, outputs, and timing
Analyzed quality improvement	The tweet improved significantly across 3 iterations — shorter, punchier, more engaging
Compared to manual looping	LangGraph provides visualization, tracing, state management, and extensibility that raw Python loops don't

[!IMPORTANT] The Reflection Agent pattern — generate → reflect → revise — is one of the fundamental building blocks of agentic AI. The same pattern applies to code generation (generate → test → fix), essay writing (draft → review → revise), and decision-making (propose → evaluate → improve). The specific chains change, but the graph structure remains the same.