Tool-use safety: stopping agents from taking dangerous actions
Tool use is where tidy demos become real risk. Once a model can write a record, send a message, change a setting or trigger a workflow, the safety problem is no longer just the text it emits. It is the side effect that follows.
The safe pattern is boring: validate the request, re-check current state, enforce permissions, add an approval gate for risky actions, and make retries idempotent. If that feels strict, good. The blast radius is smaller for a reason.
A structured tool call is not a safe action. It is only a better-shaped request. The code around it still has to decide whether the action is allowed right now.
Trust stack
AI draft model: gpt-5.4-mini. AI review model: gpt-5.4. Checked against the originating brief and current primary/near-primary sources on 2026-05-24.
Quick answer
The safe pattern is boring: validate the request, re-check current state, enforce permissions, add an approval gate for risky actions, and make retries idempotent. If that feels strict, good. The blast radius is smaller for a reason.
What this means
A tool is a function the model can call with typed parameters. The model proposes a call (tool name + arguments), and the application decides whether to execute it. That decision layer — the gate between “the model wants to do this” and “we actually do it” — is where tool-use safety lives.
Designing this gate requires answering: which tool calls execute immediately, which need confirmation, which are blocked, and how does the system recover if a call fails mid-way? The answers are different for a “read weather” tool (low risk, immediate execute), a “send email” tool (medium risk, require confirmation), and a “delete database record” tool (high risk, require human approval and logging).
Where teams misuse it
-
No approval gate between “model proposes” and “tool executes”. The model calls a tool and the application executes it without checking whether the action is appropriate for this user, this context, or this point in the workflow. A model that proposes “send email to all users” should not be executing that call without a human confirming the recipient list.
-
Idempotency as an afterthought. If a tool call fails and the model retries, does the retry create a duplicate? For a “create ticket” tool, a failed first call followed by a successful retry might create two tickets. The tool needs to be idempotent: repeat calls with the same input produce the same result as the first successful call.
-
Granting write access when read-only would suffice. A model that summarises support tickets is given a tool that can also delete or modify tickets. The developer thought “the model might need to update ticket status” but never tested what happens when the model interprets a user request as “close this ticket and mark as resolved” rather than “summarise this ticket.”
-
Failing to scope tool access per user or per session. A tool that checks customer account balance should not check any account — it should check the account the current user is authorised to see. Scoping happens at the application layer, not in the model’s tool description.
-
No audit trail for tool calls. When a tool executes a side effect, the only record is the model’s response. Nobody can later answer: what tool was called, with what arguments, by which session, and was it approved or automatic?
Real scenarios: approval gate patterns
Scenario A: “Send email” tool
A model has a send_email(to, subject, body) tool. Without an approval gate, a user could ask “send this to twenty customers” and the model would execute twenty API calls. The fix: define send_email as a review-required tool. The model proposes the call with all parameters. The application holds it in a pending state and presents it to a human reviewer (via a dashboard, Slack notification, or inline confirmation), who confirms or rejects before the email is dispatched. This is especially important for bulk, financial, or legal communications.
Scenario B: “Create ticket” tool
A model has a create_ticket(summary, priority, assignee) tool. The first call fails with a timeout. The model retries. The second call succeeds. Now there are two tickets. The fix: make the tool idempotent by including a client-generated idempotency key (e.g. a hash of the conversation ID + call index). The tool checks: “has this key been used before?” If yes, return the existing result instead of creating a duplicate. This is the same pattern payment APIs use to prevent double charges.
Scenario C: “Read customer data” tool
A model has a get_customer(account_id) tool. Without access scoping, a user could ask “what is customer 88741’s address?” and the model would retrieve it even if the user has no relationship to that account. The fix: the application intercepts the tool call and checks that account_id belongs to a customer the current session is authorised to view. If not, the tool returns an access-denied response that the model cannot override. Scoping is enforced by the application, not by a system-prompt instruction.
Practical decision check
Before giving a model access to tools that have side effects, ask:
-
Which tools can execute without human review? Separate read-only tools (no approval needed) from write tools (approval required). Further separate destructive tools (delete, irreversible changes) as requiring explicit human confirmation with re-check of current state.
-
Is every write tool idempotent? If the model retries a failed call, does the second call create a duplicate or return the original result? Add idempotency keys to prevent double-execution of side effects.
-
Are tool permissions scoped per user or per session? Can the model access data belonging to other users, or is access limited to the current session’s authorised scope?
-
Is there an audit log of every tool call? Record: tool name, arguments (sanitised), session ID, user ID, approval status (auto or reviewed), and result.
-
Can the model’s tool access be revoked mid-session? If a harmful pattern is detected mid-conversation, can you disable tool execution for the rest of that session without terminating the whole conversation?
Evidence and caveats
- Originating brief:
063-tool-use-safety-stopping-agents-from-taking-dangerous-actions.md - Check date: 2026-05-24
- This draft uses current primary or near-primary sources only for the gap-fill citations requested by the brief.
- No hands-on product claim is made unless the source path is explicit in the text.
- If provider policy, retention, tool-use or citation docs change, this page should be re-checked before promotion.
Source and evidence notes
- OWASP Top 10 for LLM Applications — https://owasp.org/www-project-top-10-for-large-language-model-applications/
- Provider function-calling / tool-use docs — https://docs.anthropic.com/ and https://platform.openai.com/docs/
- NIST AI RMF — https://www.nist.gov/itl/ai-risk-management-framework
- UK NCSC AI security guidance — https://www.ncsc.gov.uk/collection/ai-security-and-safety
Internal-link suggestions
- /run/function-calling-and-tool-use-where-agents-actually-fail/
- /run/jailbreaks-vs-product-safety-what-operators-can-realistically-control/
- /run/red-teaming-an-llm-feature-a-practical-first-week-checklist/
Related reading
- function-calling-and-tool-use-where-agents-actually-fail
- jailbreaks-vs-product-safety-what-operators-can-realistically-control
- red-teaming-an-llm-feature-a-practical-first-week-checklist
Methodology
What was checked: originating brief plus current provider/standards documentation relevant to the topic.
What the sources were used for:
- to keep the claims cautious and specific;
- to date the guidance where policy or operational details can move;
- to avoid turning source notes into marketing copy.
Assumptions and limits:
- This is an evergreen concept page, not a benchmark report.
- No launch, outreach, affiliate, payment or tracking changes are implied.
- The draft is public-clean and omits internal ticket IDs by design.
Related guides
- prompt injection explained for business users
- jailbreaks vs product safety what operators can realistically control
- ai incident response what to do when a model gives harmful or wrong advice
- ai output monitoring what to log sample and review
Change Log
- 2026-05-27: Added direct source URLs to all named providers and services; added Change Log section. Content unchanged.