Function calling and tool use: where agents actually fail
If you are building tool-using AI systems, the safe answer is: function calling helps the model ask for a tool in a structured way; tool use helps your app route, execute and return the result; neither one guarantees the model picked the right tool, used the right arguments, or stayed inside a safe state boundary.
That matters because a clean demo can still fail in production for boring reasons: stale state, the wrong tool, a retry loop that repeats a side effect, or an approval boundary that never existed in code.
In this article, function calling means a model emits a structured request for a named tool. Tool use means the wider orchestration pattern where the model requests tools, your application executes them, and the result goes back into the conversation. The format can be neat while the workflow is still wrong.
Trust stack
AI draft model: gpt-5.4-mini. AI review model: gpt-5.4. Checked against current provider documentation and BFCL leaderboard pages on 2026-05-22.
Quick answer
Use function calling when you need a model to hand off work to a named tool in a machine-readable way. Use tool use when the model is part of a larger loop that can choose tools, receive tool results, and continue reasoning.
Do not treat either one as a safety layer.
The main production failures are usually not “the model cannot output JSON”. They are:
- the model picked the wrong tool;
- the model passed a plausible but wrong argument;
- the tool acted on stale or incomplete state;
- the system retried a failing call until it created a second problem;
- the tool was allowed to make a side effect without a hard approval gate;
- the app trusted a benchmark score or a schema check on field shape as if it were a business-rule check.
Editor’s Note: A tidy tool call can hide a messy decision. The shape is not the judgement.
Editor’s Note: Most “agent failures” are systems failures wearing a model costume.
Editor’s Note: The safest tool is often a narrower tool. Product teams dislike that until the incident report arrives.
What function calling actually solves
Function calling is useful because it creates a structured bridge between language and execution.
Instead of hoping the model describes an action in prose that your code can guess, you ask it to emit a named operation with arguments. That gives you a cleaner handoff point for validation, logging and routing.
What it solves well:
- picking a tool call format that your app can parse;
- carrying arguments in a predictable shape;
- separating “model thinking” from “application execution”;
- making it easier to reject malformed requests before the tool runs.
What it does not solve:
- whether the chosen tool is the correct one;
- whether the arguments match the current state;
- whether the action is allowed right now;
- whether the tool result should be trusted without further checks;
- whether repeating the call is safe.
Current provider-doc snapshot
The checked docs point in the same direction, even if the wording differs by provider.
| Source | What the current docs say | Practical takeaway |
|---|---|---|
| OpenAI structured model outputs | Structured output and function calling live in the same docs family; the emphasis is on producing machine-readable structure. | Structure helps parsing, not judgement. |
| Anthropic tool use with Claude | The model emits tool-use blocks and your application executes the tool result path. | Execution still happens in your code, so policy and approval gates stay yours. |
| Azure OpenAI JSON mode | JSON mode guarantees valid JSON output, but not a specific schema. | Syntax-only control is not enough for tool safety. |
| Azure OpenAI structured outputs | Structured outputs follow a JSON Schema you provide and are stricter than JSON mode. | Use schema checks where downstream code depends on field shape. |
| BFCL V4 leaderboard | The benchmark evaluates function-calling accuracy on real-world data and is updated periodically. | Benchmarks are useful evidence, not a guarantee of safe production behaviour. |
The short version: schema discipline matters, but workflow safety still lives in the code around the model.
Where tool-using workflows break
The failure usually appears after the first apparently successful call.
Wrong tool, wrong args, wrong state
This is the most common production shape of failure:
- the model chooses the wrong tool because the prompt was ambiguous;
- it chooses the right tool but passes the wrong identifier, date, or scope;
- it calls the right tool against stale state because the world changed after retrieval;
- the downstream system accepts the call because the shape looked fine.
A tool call can be perfectly formatted and still be operationally wrong.
Failure-mode table
| Failure type | What looks fine | What still fails | Mitigation |
|---|---|---|---|
| Wrong tool choice | Structured call parses cleanly | The action does the wrong job | Narrow the tool set; add tool selection rules; require explicit intent cues. |
| Wrong arguments | Arguments fit the schema | The call targets the wrong record, date or user | Validate against current state; cross-check IDs and permissions. |
| Stale state | Retrieval looked plausible | The world changed after the model saw it | Refresh critical state before execution; shorten state-sensitive loops. |
| Retry loop | The code retries cleanly | The tool repeats a side effect or amplifies load | Cap retries; use idempotency keys; separate transient failures from permanent ones. |
| Side effect without approval | The model got the call right | The action still should not run yet | Add explicit approval gates and human confirmation for risky actions. |
| Benchmark over-trust | A leaderboard score looks strong | Real production edge cases still fail | Treat benchmarks as selection evidence, not deployment permission. |
Why retries and side effects get dangerous
Retries are useful when the failure is transient and the action is safe to repeat.
They are much less useful when the action changes state.
If a tool can send a message, submit a form, write a record, charge money, or change an account setting, then an innocent retry can become a duplicate side effect. A model can also produce the same wrong request twice if the surrounding loop keeps asking it to “try again” without changing the underlying problem.
The safety rule is simple:
- retry parse failures and transient transport errors cautiously;
- do not blindly retry non-idempotent actions;
- make the retry limit explicit;
- log the rejection reason so humans can see why the system stopped.
What to guard with approval checks and sandboxing
The right control is not “make the model smarter”. The right control is “reduce the blast radius before the tool runs”.
Use approval checks when the action affects money, access, user-visible state, or irreversible history.
Use sandboxing when you are still learning whether the tool path behaves the way you expect.
A safer production path looks like this:
- Ask for a structured tool call.
- Parse the request.
- Validate the arguments against schema and business rules.
- Re-check current state.
- Apply permission and approval gates.
- Execute only if the action is still safe.
- Record the result and the reason for any rejection.
Editor’s Note: The production bug is often not the tool call. It is the missing check that should have stopped it.
A small rollout checklist
Before you let an LLM touch live tools, check these items:
- list every tool and classify it by risk;
- decide which tools are read-only and which create side effects;
- mark the actions that need human approval;
- define schema validation and business-rule validation separately;
- cap retries and make the retry policy visible;
- choose an idempotency strategy for repeatable actions;
- test with stale data, bad IDs and ambiguous prompts;
- run the tool path in a sandbox before enabling live writes;
- log every rejected call and every approval gate;
- keep the tool set narrower than the product team usually wants.
If the checklist feels strict, that is usually a good sign.
Visual handling
There is no physical-device or hardware visual issue on this page.
If Dev/UX wants a visual, make it a small workflow diagram, not a hype illustration. A simple request → validate → approve → execute flow is enough.
Global applicability
This article is global. There is no GB / NI split to apply.
The underlying risk is the same in every market: if an LLM can affect money, access, state, or a user-visible decision, format alone is not safety.
Glossary pass
Key terms are defined in plain English on first use in the article body, including function calling, tool use, schema, side effect, sandbox, retry loop, and idempotency.
Methodology and sources
Check date: 2026-05-22
What was checked:
- OpenAI structured model outputs documentation.
- Anthropic tool use documentation.
- Azure OpenAI JSON mode documentation.
- Azure OpenAI structured outputs documentation.
- BFCL V4 leaderboard page.
- Selected live topic pages used as a framing check on the target topic.
Data points pulled from those pages:
- OpenAI’s docs family for structured outputs and function calling emphasizes machine-readable structure.
- Anthropic’s tool-use docs describe the tool-use block / application-executes / tool-result pattern.
- Azure’s JSON mode docs state that valid JSON is not the same as a guaranteed schema.
- Azure’s structured outputs docs state that the model follows a JSON Schema supplied in the call.
- BFCL states that it evaluates function-calling accuracy on real-world data and updates periodically.
Assumptions and limits:
- This guide is a practical synthesis, not a full market study.
- The failure-mode table is illustrative; it is not a claim about any single vendor’s incident history.
- No production incident claims are made here.
- No formula is needed for this article because the useful control is the validation flow, not arithmetic.
Change log
- 2026-05-22: first draft built from the llm-editor-approved brief with current provider-doc checks, a failure-mode table, a rollout checklist, glossary coverage and explicit safety caveats.
Source list
- OpenAI structured model outputs: https://developers.openai.com/api/docs/guides/structured-outputs
- Anthropic tool use with Claude: https://platform.claude.com/docs/en/agents-and-tools/tool-use/overview
- Azure OpenAI JSON mode: https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/json-mode
- Azure OpenAI structured outputs: https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/structured-outputs
- Berkeley Function Calling Leaderboard V4: https://gorilla.cs.berkeley.edu/leaderboard.html