05.02 — Understanding Function Calling for LLMs¶

Overview¶

This lesson provides a deep theoretical understanding of how function calling works — the mechanics behind the JSON response, how models are fine-tuned to support it, the two distinct use cases (tool integration and structured output), and the advantages and tradeoffs compared to the ReAct approach.

What Is Function Calling?¶

Function calling (or tool calling) is an LLM capability where the model produces a structured function invocation — specifying which function to call and with what arguments — instead of generating plain text.

flowchart LR
    subgraph Without["Without Function Calling"]
        W1["User: What's the weather in Paris?"]
        W2["LLM: The weather in Paris\nis currently... (hallucinates)"]
        W1 --> W2
    end

    subgraph With["With Function Calling"]
        F1["User: What's the weather in Paris?"]
        F2["LLM: tool_call(\nname='get_weather',\nargs={'location': 'Paris'}\n)"]
        F3["App: execute get_weather('Paris')"]
        F4["App → LLM: Result: 18°C, sunny"]
        F5["LLM: The weather in Paris is 18°C."]
        F1 --> F2 --> F3 --> F4 --> F5
    end

    style Without fill:#ef4444,color:#fff
    style With fill:#10b981,color:#fff

The Key Distinction¶

Without function calling, the LLM can only generate text. If you ask about the weather, it either hallucates an answer from its training data or admits it doesn't know. With function calling, the LLM can say: "I need to call the get_weather function with the argument Paris" — and the application can execute that function and feed the real result back.

How Function Calling Works¶

Step 1: Bind Function Definitions¶

Before sending a request to the LLM, the application provides a list of function definitions — describing each available function's name, parameters, and purpose:

{
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city name, e.g., 'Paris'"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"],
              "description": "Temperature unit"
            }
          },
          "required": ["location"]
        }
      }
    }
  ]
}

The LLM receives these definitions alongside the user's message. It now "knows" that get_current_weather exists and how to call it.

Step 2: LLM Decides Whether to Call a Function¶

The fine-tuned model analyzes the user's request and decides: - If the request can be answered directly → generates normal text - If a function is needed → produces a structured tool call

This decision is made internally by the model — you don't need prompt engineering to trigger it.

Step 3: Application Executes the Function¶

The application receives the structured tool call, parses the JSON, and executes the corresponding function:

# The LLM response contains:
tool_call = {
    "name": "get_current_weather",
    "arguments": {
        "location": "Paris",
        "unit": "celsius"
    }
}

# Application executes the actual function:
result = get_current_weather(
    location=tool_call["arguments"]["location"],
    unit=tool_call["arguments"]["unit"]
)
# result = {"temp": 18, "condition": "sunny"}

Step 4: Feed Result Back to LLM¶

The application sends the function result back to the LLM, which then generates the final response using real data:

sequenceDiagram
    participant U as 👤 User
    participant A as 🖥️ App
    participant L as 🤖 LLM
    participant F as 🔧 get_weather()

    U->>A: "What's the weather in Paris?"
    A->>L: User message + tool definitions
    L->>A: tool_call: get_weather(Paris, celsius)
    A->>F: Execute function
    F->>A: {temp: 18, condition: "sunny"}
    A->>L: Tool result: {temp: 18, condition: "sunny"}
    L->>A: "It's 18°C and sunny in Paris."
    A->>U: Display answer

[!NOTE] The LLM never executes the function itself. It only produces the intent to call a function. The application is responsible for actual execution. This is a critical security boundary — the LLM can't directly access databases, APIs, or file systems without the application's permission.

The Fine-Tuning Behind Function Calling¶

Function calling isn't a prompt trick — the LLM has been specifically fine-tuned for this capability:

flowchart TD
    BASE["Base LLM\n(Text generation only)"]
    FT["Fine-tuning on\nmillions of function call examples"]
    FC["Function-calling LLM"]

    BASE --> FT --> FC

    FC --> D1["Detects when a function\nis needed"]
    FC --> D2["Selects the correct function\nfrom available options"]
    FC --> D3["Extracts arguments from\nthe user's query"]
    FC --> D4["Formats response as\nvalid JSON schema"]

    style BASE fill:#ef4444,color:#fff
    style FC fill:#10b981,color:#fff

This fine-tuning teaches the model to: 1. Recognize function-appropriate queries — "What's the weather?" triggers a function call, "Tell me a joke" doesn't 2. Match queries to functions — if 5 tools are available, pick the right one 3. Extract arguments — pull "Paris" from "What's the weather in Paris?" 4. Adhere to schemas — produce valid JSON that matches the function's parameter definitions

[!IMPORTANT] Not all LLMs support function calling — it requires specific fine-tuning. However, all major state-of-the-art models (OpenAI GPT-4/3.5, Anthropic Claude, Google Gemini) support it as of 2024. OpenAI introduced function calling in June 2023, and other vendors quickly followed.

Two Use Cases of Function Calling¶

Function calling isn't just for calling external tools — it has two distinct applications:

Use Case 1: External Tool Integration¶

The primary use case: connecting the LLM to external systems.

flowchart LR
    LLM["🤖 LLM"] -->|"tool_call"| DB["🗄️ Database Query"]
    LLM -->|"tool_call"| API["🌐 REST API Call"]
    LLM -->|"tool_call"| SEARCH["🔍 Web Search"]
    LLM -->|"tool_call"| CALC["🧮 Calculator"]
    LLM -->|"tool_call"| FILE["📁 File System"]

    style LLM fill:#4a9eff,color:#fff

Examples: - Search the web for current information (Tavily, Google) - Query a database for user records - Call a weather API - Execute code in a sandbox - Send an email or create a calendar event

Use Case 2: Structured Output¶

A less obvious but equally powerful use case: forcing the LLM to produce output in a specific format.

Instead of hoping the LLM returns properly formatted data, you define a function schema that describes your desired output format, and the LLM is forced to fill in those fields:

# Define the schema (Pydantic model)
class MovieReview(BaseModel):
    title: str = Field(description="Movie title")
    rating: float = Field(description="Rating from 1-10")
    summary: str = Field(description="One-sentence summary")
    pros: list[str] = Field(description="List of positive aspects")
    cons: list[str] = Field(description="List of negative aspects")

# Bind as tool → LLM MUST fill all fields
llm.bind_tools([MovieReview], tool_choice="MovieReview")

The LLM is now forced to produce output that matches this schema. You get a typed, predictable response instead of free-form text.

[!TIP] This is exactly the technique used in the Reflexion Agent (Section 12) — the AnswerQuestion and ReviseAnswer Pydantic models are bound as tools to force the LLM to produce structured article + critique + search queries.

Use Case	Goal	Example
Tool Integration	Execute external functions	Call a weather API, search the web, query a database
Structured Output	Force response format	Extract data into typed fields (name, rating, summary)

Advantages of Function Calling¶

1. Structured and Reliable Integration¶

The model's output is machine-readable JSON with a specific structure:

{
  "tool_calls": [
    {
      "function": {
        "name": "get_current_weather",
        "arguments": "{\"location\": \"Paris\", \"unit\": \"celsius\"}"
      }
    }
  ]
}

Compared to ReAct's text-based output, this is: - Trivially parsable — json.loads() instead of regex - Schema-validated — the JSON must match the function definition - Unambiguous — no risk of the function name being mixed with commentary

2. Token Efficiency¶

Function calling saves tokens because the LLM doesn't need to produce the chain-of-thought reasoning that the ReAct prompt requires:

Approach	Token Usage
ReAct prompt	`Thought: I need to check the weather... Action: get_weather... Action Input: {"location": "Paris"}... Observation: ...` (~100+ tokens of reasoning)
Function calling	`tool_call: get_weather(location="Paris")` (~20 tokens)

The reasoning still happens, but it's internal to the model — only the final decision (which function, what arguments) is output. This means fewer output tokens, which translates to lower cost and faster responses.

3. Vendor-Managed Quality¶

With the ReAct prompt, you are responsible for reliability through prompt engineering. With function calling, the LLM vendor (OpenAI, Anthropic, Google) has fine-tuned the model and invested engineering effort into making it reliable. You benefit from their ongoing improvements without changing your code.

The Tradeoff: Opaque Reasoning¶

Function calling has one notable drawback: you can't see the LLM's reasoning.

flowchart LR
    subgraph ReAct["ReAct: Visible Reasoning"]
        RT["Thought: I need to check the weather\nbecause the user asked about Paris.\nI should use get_weather."]
        RA["Action: get_weather"]
    end

    subgraph FC["Function Calling: Opaque"]
        FB["🔒 Internal reasoning\n(hidden)"]
        FO["Output: tool_call(get_weather)"]
    end

    style ReAct fill:#4a9eff,color:#fff
    style FC fill:#8b5cf6,color:#fff

Aspect	ReAct	Function Calling
Reasoning visibility	Full chain-of-thought is output	Hidden — only the final decision is visible
Debugging	Easy — read the thought process	Harder — you see what it called but not why
Auditing	Full audit trail	Limited — tool call + arguments only

In practice, this tradeoff is worth it. The reliability improvement (from ~85% to >99%) far outweighs the loss of reasoning visibility. And for debugging, tools like LangSmith provide tracing that compensates for the opaque reasoning.

[!NOTE] Some models combine both approaches — they do chain-of-thought reasoning AND produce structured function calls. OpenAI's reasoning models and Anthropic's extended thinking provide visible reasoning before making tool calls.

Function Calling vs. ReAct — Full Comparison¶

Dimension	ReAct Prompt	Function Calling
Mechanism	Text generation + regex parsing	Fine-tuned JSON generation in dedicated field
Reliability	~85–95%	>99% with modern models
Parsing	Regex (brittle)	`json.loads()` (trivial)
Token cost	High (chain-of-thought output)	Low (only function call output)
Reasoning visibility	Full chain-of-thought visible	Opaque (internal reasoning)
Who's responsible	Developer (prompt + regex)	Model vendor (fine-tuning)
Structured output	Not easily achievable	Native use case
Multi-vendor support	Any LLM (prompt-based)	Requires fine-tuned model
Production readiness	Prototype/demo only	Production standard

Summary¶

Concept	Key Takeaway
Definition	LLM produces structured JSON to invoke external functions, with name and arguments
How it works	Bind function definitions → LLM decides if/which to call → App executes → Result fed back
Fine-tuning	Models are specifically trained to detect, select, and format function calls
Use Case 1	External tool integration (APIs, databases, search engines)
Use Case 2	Structured output (force response into Pydantic schemas)
Advantages	Reliable JSON output, token-efficient, vendor-managed quality
Tradeoff	Opaque reasoning — you see the decision but not the justification
Industry status	De facto standard since 2023; all production AI agents use function calling