Hero image for Function calling and tool use — where agents actually fail

Draft: Function calling and tool use—where agents actually fail

Introduction

The gap between a polished agent demo and a production-ready system is often measured in the reliability of its tool use. While “function calling” makes it appear as though an LLM has acquired agency, the reality is that most failures occur not because the model lacks intelligence, but because the interface between text generation and structured execution is brittle. This article explores the primary failure modes of agentic tool use and the systems-design patterns required to mitigate them.

What Function Calling Actually Solves

Function calling provides a bridge between unstructured reasoning and structured action. It allows a model to transition from “thinking” in natural language to “execates” via JSON-compliant schema. When working correctly, it enables deterministic execution of side effects—like querying a database or sending an email—triggered by probabilistic reasoning.

Where Tool-Using Workflows Break

The reliability of tool use is threatened by several distinct failure modes that span the spectrum from model hallucination to environmental mismatch.

1. The “Bloated Toolbox” Problem (Context Overload)

As documented by Anthropic’s engineering teams, providing an agent with a massive, undifferentiated array of tools increases cognitive load. When the toolset is too large, the model struggles with retrieval-style reasoning, often selecting a tool that is semantically similar but functionally incorrect. Mitigation: Implement hierarchical or tiered toolsets. Use a “Router” or “Initializer” agent to select a specialized sub-agent with only the relevant tools for the task at hand.

2. Schema Mismatch and Argument Hallucination

Even with features like OpenAI’s “Strict Mode,” models can hallucinate parameter values that exist within the schema but are invalid in the real world (e.g., passing a valid string ID that doesn’t exist in the database). Mitigation: Implement a robust validation layer between the model and the tool. Use Pydantic or Zod to validate types, and—crucially—intercept execution errors and feed them back to the model as a tool_result error message.

3. The Stale State Trap

An agent may successfully call delete_file(path='config.json') but then immediately attempt to read_file(path='config.json'), unaware that its previous action permanently altered the environment. This “amnesia” occurs because the model’s internal state (its context window) is not automatically synchronized with the external world’s state. Mitigation: Ensure every tool response contains a high-signal summary of how the environment changed, or include periodic “environment snapshots” in the system prompt.

4. The Infinite Retry Loop and Side-Effect Risk

A failure to handle errors properly can lead to a catastrophic loop: a tool fails → an error is returned → the model blindly retries the same failing command. If the tool involves a non-idempotent action (like charge_credit_card), this loop becomes a financial risk. Mitigation: Strictly enforce retry budgets and implement idempotency keys for all high-risk tools.

Failure-Mode Mapping: Type to Mitigation

[Editor’s Note: The Demo Trap]

The “magic” of an agent demo usually stems from the fact that the environment is perfectly prepared. In production, the latency of tools, the frequency of transient network errors, and the complexity of schema enforcement are what actually consume your engineering budget.

[Editor’s Note: Reliability via Narrowing]

The most successful ‘agentic’ systems often look less like autonomous explorers and more like highly controlled orchestrators. The safest tool is often a narrower, more specialized tool than the product team originally envisioned.

A Small Rollout Checklist for Safe Tool Use

Validator Layer: Does every tool call pass through a schema validator before execution?
[ $\text{x}$ ] Error Feedback: Are execution errors (404s, 500s, validation failures) returned to the model as text?
Retry Budget: Is there a hard cap on how many times an agent can call the same tool in a single session?
Idempotency Check: Do all side-effect-heavy tools support idempotency keys?
Human-in-the-loop (HITL): Are high-risk actions (deletions, payments, emails) gated by an approval step?

Methodology and Sources

Research conducted on June 21, 2026. Primary sources include OpenAI’s Structured Outputs documentation and Anthropic’s engineering insights regarding tool use reliability and context engineering. Findings synthesized from developer discussions (Reddit/StackOverflow) regarding schema enforcement limitations in production environments.

Change log

v1.0 (2026-06-21): Initial draft completed based on editorial brief.