theLLMs

Last checked: 2026-05-22

Scope: Global audience. This guide is not UK-only. Provider docs and benchmark pages were checked on 2026-05-22, but live tooling labels and product wording can move quickly.

AI draft model: gpt-5.4-mini

AI review model: llm-editor (deepseek-v4-pro)

Function calling and tool use: where agents actually fail

If you are building tool-using AI systems, the safe answer is: function calling helps the model ask for a tool in a structured way; tool use helps your app route, execute and return the result; neither one guarantees the model picked the right tool, used the right arguments, or stayed inside a safe state boundary.

That matters because a clean demo can still fail in production for boring reasons: stale state, the wrong tool, a retry loop that repeats a side effect, or an approval boundary that never existed in code.

In this article, function calling means a model emits a structured request for a named tool. Tool use means the wider orchestration pattern where the model requests tools, your application executes them, and the result goes back into the conversation. The format can be neat while the workflow is still wrong.

Trust stack

AI draft model: gpt-5.4-mini. AI review model: gpt-5.4. Checked against current provider documentation and BFCL leaderboard pages on 2026-05-22.

Quick answer

Use function calling when you need a model to hand off work to a named tool in a machine-readable way. Use tool use when the model is part of a larger loop that can choose tools, receive tool results, and continue reasoning.

Do not treat either one as a safety layer.

The main production failures are usually not “the model cannot output JSON”. They are:

  • the model picked the wrong tool;
  • the model passed a plausible but wrong argument;
  • the tool acted on stale or incomplete state;
  • the system retried a failing call until it created a second problem;
  • the tool was allowed to make a side effect without a hard approval gate;
  • the app trusted a benchmark score or a schema check on field shape as if it were a business-rule check.

Editor’s Note: A tidy tool call can hide a messy decision. The shape is not the judgement.

Editor’s Note: Most “agent failures” are systems failures wearing a model costume.

Editor’s Note: The safest tool is often a narrower tool. Product teams dislike that until the incident report arrives.

What function calling actually solves

Function calling is useful because it creates a structured bridge between language and execution.

Instead of hoping the model describes an action in prose that your code can guess, you ask it to emit a named operation with arguments. That gives you a cleaner handoff point for validation, logging and routing.

What it solves well:

  • picking a tool call format that your app can parse;
  • carrying arguments in a predictable shape;
  • separating “model thinking” from “application execution”;
  • making it easier to reject malformed requests before the tool runs.

What it does not solve:

  • whether the chosen tool is the correct one;
  • whether the arguments match the current state;
  • whether the action is allowed right now;
  • whether the tool result should be trusted without further checks;
  • whether repeating the call is safe.

Current provider-doc snapshot

The checked docs point in the same direction, even if the wording differs by provider.

SourceWhat the current docs sayPractical takeaway
OpenAI structured model outputsStructured output and function calling live in the same docs family; the emphasis is on producing machine-readable structure.Structure helps parsing, not judgement.
Anthropic tool use with ClaudeThe model emits tool-use blocks and your application executes the tool result path.Execution still happens in your code, so policy and approval gates stay yours.
Azure OpenAI JSON modeJSON mode guarantees valid JSON output, but not a specific schema.Syntax-only control is not enough for tool safety.
Azure OpenAI structured outputsStructured outputs follow a JSON Schema you provide and are stricter than JSON mode.Use schema checks where downstream code depends on field shape.
BFCL V4 leaderboardThe benchmark evaluates function-calling accuracy on real-world data and is updated periodically.Benchmarks are useful evidence, not a guarantee of safe production behaviour.

The short version: schema discipline matters, but workflow safety still lives in the code around the model.

Where tool-using workflows break

The failure usually appears after the first apparently successful call.

Wrong tool, wrong args, wrong state

This is the most common production shape of failure:

  • the model chooses the wrong tool because the prompt was ambiguous;
  • it chooses the right tool but passes the wrong identifier, date, or scope;
  • it calls the right tool against stale state because the world changed after retrieval;
  • the downstream system accepts the call because the shape looked fine.

A tool call can be perfectly formatted and still be operationally wrong.

Failure-mode table

Failure typeWhat looks fineWhat still failsMitigation
Wrong tool choiceStructured call parses cleanlyThe action does the wrong jobNarrow the tool set; add tool selection rules; require explicit intent cues.
Wrong argumentsArguments fit the schemaThe call targets the wrong record, date or userValidate against current state; cross-check IDs and permissions.
Stale stateRetrieval looked plausibleThe world changed after the model saw itRefresh critical state before execution; shorten state-sensitive loops.
Retry loopThe code retries cleanlyThe tool repeats a side effect or amplifies loadCap retries; use idempotency keys; separate transient failures from permanent ones.
Side effect without approvalThe model got the call rightThe action still should not run yetAdd explicit approval gates and human confirmation for risky actions.
Benchmark over-trustA leaderboard score looks strongReal production edge cases still failTreat benchmarks as selection evidence, not deployment permission.

Why retries and side effects get dangerous

Retries are useful when the failure is transient and the action is safe to repeat.

They are much less useful when the action changes state.

If a tool can send a message, submit a form, write a record, charge money, or change an account setting, then an innocent retry can become a duplicate side effect. A model can also produce the same wrong request twice if the surrounding loop keeps asking it to “try again” without changing the underlying problem.

The safety rule is simple:

  • retry parse failures and transient transport errors cautiously;
  • do not blindly retry non-idempotent actions;
  • make the retry limit explicit;
  • log the rejection reason so humans can see why the system stopped.

What to guard with approval checks and sandboxing

The right control is not “make the model smarter”. The right control is “reduce the blast radius before the tool runs”.

Use approval checks when the action affects money, access, user-visible state, or irreversible history.

Use sandboxing when you are still learning whether the tool path behaves the way you expect.

A safer production path looks like this:

  1. Ask for a structured tool call.
  2. Parse the request.
  3. Validate the arguments against schema and business rules.
  4. Re-check current state.
  5. Apply permission and approval gates.
  6. Execute only if the action is still safe.
  7. Record the result and the reason for any rejection.

Editor’s Note: The production bug is often not the tool call. It is the missing check that should have stopped it.

A small rollout checklist

Before you let an LLM touch live tools, check these items:

  • list every tool and classify it by risk;
  • decide which tools are read-only and which create side effects;
  • mark the actions that need human approval;
  • define schema validation and business-rule validation separately;
  • cap retries and make the retry policy visible;
  • choose an idempotency strategy for repeatable actions;
  • test with stale data, bad IDs and ambiguous prompts;
  • run the tool path in a sandbox before enabling live writes;
  • log every rejected call and every approval gate;
  • keep the tool set narrower than the product team usually wants.

If the checklist feels strict, that is usually a good sign.

Visual handling

There is no physical-device or hardware visual issue on this page.

If Dev/UX wants a visual, make it a small workflow diagram, not a hype illustration. A simple request → validate → approve → execute flow is enough.

Global applicability

This article is global. There is no GB / NI split to apply.

The underlying risk is the same in every market: if an LLM can affect money, access, state, or a user-visible decision, format alone is not safety.

Glossary pass

Key terms are defined in plain English on first use in the article body, including function calling, tool use, schema, side effect, sandbox, retry loop, and idempotency.

Methodology and sources

Check date: 2026-05-22

What was checked:

  • OpenAI structured model outputs documentation.
  • Anthropic tool use documentation.
  • Azure OpenAI JSON mode documentation.
  • Azure OpenAI structured outputs documentation.
  • BFCL V4 leaderboard page.
  • Selected live topic pages used as a framing check on the target topic.

Data points pulled from those pages:

  • OpenAI’s docs family for structured outputs and function calling emphasizes machine-readable structure.
  • Anthropic’s tool-use docs describe the tool-use block / application-executes / tool-result pattern.
  • Azure’s JSON mode docs state that valid JSON is not the same as a guaranteed schema.
  • Azure’s structured outputs docs state that the model follows a JSON Schema supplied in the call.
  • BFCL states that it evaluates function-calling accuracy on real-world data and updates periodically.

Assumptions and limits:

  • This guide is a practical synthesis, not a full market study.
  • The failure-mode table is illustrative; it is not a claim about any single vendor’s incident history.
  • No production incident claims are made here.
  • No formula is needed for this article because the useful control is the validation flow, not arithmetic.

Change log

  • 2026-05-22: first draft built from the llm-editor-approved brief with current provider-doc checks, a failure-mode table, a rollout checklist, glossary coverage and explicit safety caveats.

Source list