Tool-use safety: stopping agents from taking dangerous actions

Tool use is where tidy demos become real risk. Once a model can write a record, send a message, change a setting or trigger a workflow, the safety problem is no longer just the text it emits. It is the side effect that follows.

The safe pattern is boring: validate the request, re-check current state, enforce permissions, add an approval gate for risky actions, and make retries idempotent. If that feels strict, good. The blast radius is smaller for a reason.

A structured tool call is not a safe action. It is only a better-shaped request. The code around it still has to decide whether the action is allowed right now.

TL;DR

What this means

A tool is a function the model can call with typed parameters. The model proposes a call (tool name + arguments), and the application decides whether to execute it. That decision layer — the gate between “the model wants to do this” and “we actually do it” — is where tool-use safety lives.

Designing this gate requires answering: which tool calls execute immediately, which need confirmation, which are blocked, and how does the system recover if a call fails mid-way? The answers are different for a “read weather” tool (low risk, immediate execute), a “send email” tool (medium risk, require confirmation), and a “delete database record” tool (high risk, require human approval and logging).

Where teams misuse it

No approval gate between “model proposes” and “tool executes”. The model calls a tool and the application executes it without checking whether the action is appropriate for this user, this context, or this point in the workflow. A model that proposes “send email to all users” should not be executing that call without a human confirming the recipient list.
Idempotency as an afterthought. If a tool call fails and the model retries, does the retry create a duplicate? For a “create ticket” tool, a failed first call followed by a successful retry might create two tickets. The tool needs to be idempotent: repeat calls with the same input produce the same result as the first successful call.
Granting write access when read-only would suffice. A model that summarises support tickets is given a tool that can also delete or modify tickets. The developer thought “the model might need to update ticket status” but never tested what happens when the model interprets a user request as “close this ticket and mark as resolved” rather than “summarise this ticket.”
Failing to scope tool access per user or per session. A tool that checks customer account balance should not check any account — it should check the account the current user is authorised to see. Scoping happens at the application layer, not in the model’s tool description.
No audit trail for tool calls. When a tool executes a side effect, the only record is the model’s response. Nobody can later answer: what tool was called, with what arguments, by which session, and was it approved or automatic?

Real scenarios: approval gate patterns

Scenario A: “Send email” tool

A model has a send_email(to, subject, body) tool. Without an approval gate, a user could ask “send this to twenty customers” and the model would execute twenty API calls. The fix: define send_email as a review-required tool. The model proposes the call with all parameters. The application holds it in a pending state and presents it to a human reviewer (via a dashboard, Slack notification, or inline confirmation), who confirms or rejects before the email is dispatched. This is especially important for bulk, financial, or legal communications.

Scenario B: “Create ticket” tool

A model has a create_ticket(summary, priority, assignee) tool. The first call fails with a timeout. The model retries. The second call succeeds. Now there are two tickets. The fix: make the tool idempotent by including a client-generated idempotency key (e.g. a hash of the conversation ID + call index). The tool checks: “has this key been used before?” If yes, return the existing result instead of creating a duplicate. This is the same pattern payment APIs use to prevent double charges.

Scenario C: “Read customer data” tool

A model has a get_customer(account_id) tool. Without access scoping, a user could ask “what is customer 88741’s address?” and the model would retrieve it even if the user has no relationship to that account. The fix: the application intercepts the tool call and checks that account_id belongs to a customer the current session is authorised to view. If not, the tool returns an access-denied response that the model cannot override. Scoping is enforced by the application, not by a system-prompt instruction.

Practical decision check

Before giving a model access to tools that have side effects, ask:

Which tools can execute without human review? Separate read-only tools (no approval needed) from write tools (approval required). Further separate destructive tools (delete, irreversible changes) as requiring explicit human confirmation with re-check of current state.
Is every write tool idempotent? If the model retries a failed call, does the second call create a duplicate or return the original result? Add idempotency keys to prevent double-execution of side effects.
Are tool permissions scoped per user or per session? Can the model access data belonging to other users, or is access limited to the current session’s authorised scope?
Is there an audit log of every tool call? Record: tool name, arguments (sanitised), session ID, user ID, approval status (auto or reviewed), and result.
Can the model’s tool access be revoked mid-session? If a harmful pattern is detected mid-conversation, can you disable tool execution for the rest of that session without terminating the whole conversation?

Methodology

Data checked: 2026-05-28
Sources consulted: OWASP Top 10 for LLM Applications, provider function-calling/tool-use documentation (Anthropic, OpenAI), NIST AI RMF, UK NCSC AI security guidance
Assumptions: This is an evergreen concept page. Provider tool-use APIs and safety features evolve — verify against current documentation. The approval gate and idempotency patterns assume a server-side application architecture where the application mediates all tool calls.
Limitations: This guide covers tool-use safety patterns for text-based LLM agents. It does not cover multimodal agent safety, code-execution sandboxing, or browser-automation tool safety. For prompt injection defence, see the separate prompt injection guide.
Jurisdiction: Global. References UK NCSC guidance as an example of standards-based security thinking. No jurisdiction-specific regulatory advice.

Source list

[1] OWASP Top 10 for LLM Applications — https://owasp.org/www-project-top-10-for-large-language-model-applications/ (accessed 2026-05-28)
[2] Anthropic tool-use documentation — https://docs.anthropic.com/ (accessed 2026-05-28)
[3] OpenAI function-calling documentation — https://platform.openai.com/docs/ (accessed 2026-05-28)
[4] NIST AI RMF — https://www.nist.gov/itl/ai-risk-management-framework (accessed 2026-05-28)
[5] UK NCSC AI security guidance — https://www.ncsc.gov.uk/collection/ai-security-and-safety (accessed 2026-05-28)

Trust Stack

Last checked: 2026-05-28
Corrections: Contact us to report errors

Change log

2026-05-28: editorial review — corrected writtenBy, removed workflow leaks (brief reference, internal-link suggestions), added 3 Editor’s Note cards, proper Trust Stack, slugified heading IDs, standardized Methodology and Source list
2026-05-27: added direct source URLs; added Change Log section
2026-05-26: first published