theLLMs

Last checked: 2026-05-24

Scope: Global. Provider and standards sources checked as of 2026-05-24.

AI draft model: gpt-5.4-mini

AI review model: llm-editor (deepseek-v4-pro)

Tool-use safety: stopping agents from taking dangerous actions

Tool use is where tidy demos become real risk. Once a model can write a record, send a message, change a setting or trigger a workflow, the safety problem is no longer just the text it emits. It is the side effect that follows.

The safe pattern is boring: validate the request, re-check current state, enforce permissions, add an approval gate for risky actions, and make retries idempotent. If that feels strict, good. The blast radius is smaller for a reason.

A structured tool call is not a safe action. It is only a better-shaped request. The code around it still has to decide whether the action is allowed right now.

Trust stack

AI draft model: gpt-5.4-mini. AI review model: gpt-5.4. Checked against the originating brief and current primary/near-primary sources on 2026-05-24.

Quick answer

The safe pattern is boring: validate the request, re-check current state, enforce permissions, add an approval gate for risky actions, and make retries idempotent. If that feels strict, good. The blast radius is smaller for a reason.

What this means

A tool is a function the model can call with typed parameters. The model proposes a call (tool name + arguments), and the application decides whether to execute it. That decision layer — the gate between “the model wants to do this” and “we actually do it” — is where tool-use safety lives.

Designing this gate requires answering: which tool calls execute immediately, which need confirmation, which are blocked, and how does the system recover if a call fails mid-way? The answers are different for a “read weather” tool (low risk, immediate execute), a “send email” tool (medium risk, require confirmation), and a “delete database record” tool (high risk, require human approval and logging).

Where teams misuse it

  • No approval gate between “model proposes” and “tool executes”. The model calls a tool and the application executes it without checking whether the action is appropriate for this user, this context, or this point in the workflow. A model that proposes “send email to all users” should not be executing that call without a human confirming the recipient list.

  • Idempotency as an afterthought. If a tool call fails and the model retries, does the retry create a duplicate? For a “create ticket” tool, a failed first call followed by a successful retry might create two tickets. The tool needs to be idempotent: repeat calls with the same input produce the same result as the first successful call.

  • Granting write access when read-only would suffice. A model that summarises support tickets is given a tool that can also delete or modify tickets. The developer thought “the model might need to update ticket status” but never tested what happens when the model interprets a user request as “close this ticket and mark as resolved” rather than “summarise this ticket.”

  • Failing to scope tool access per user or per session. A tool that checks customer account balance should not check any account — it should check the account the current user is authorised to see. Scoping happens at the application layer, not in the model’s tool description.

  • No audit trail for tool calls. When a tool executes a side effect, the only record is the model’s response. Nobody can later answer: what tool was called, with what arguments, by which session, and was it approved or automatic?

Real scenarios: approval gate patterns

Scenario A: “Send email” tool

A model has a send_email(to, subject, body) tool. Without an approval gate, a user could ask “send this to twenty customers” and the model would execute twenty API calls. The fix: define send_email as a review-required tool. The model proposes the call with all parameters. The application holds it in a pending state and presents it to a human reviewer (via a dashboard, Slack notification, or inline confirmation), who confirms or rejects before the email is dispatched. This is especially important for bulk, financial, or legal communications.

Scenario B: “Create ticket” tool

A model has a create_ticket(summary, priority, assignee) tool. The first call fails with a timeout. The model retries. The second call succeeds. Now there are two tickets. The fix: make the tool idempotent by including a client-generated idempotency key (e.g. a hash of the conversation ID + call index). The tool checks: “has this key been used before?” If yes, return the existing result instead of creating a duplicate. This is the same pattern payment APIs use to prevent double charges.

Scenario C: “Read customer data” tool

A model has a get_customer(account_id) tool. Without access scoping, a user could ask “what is customer 88741’s address?” and the model would retrieve it even if the user has no relationship to that account. The fix: the application intercepts the tool call and checks that account_id belongs to a customer the current session is authorised to view. If not, the tool returns an access-denied response that the model cannot override. Scoping is enforced by the application, not by a system-prompt instruction.

Practical decision check

Before giving a model access to tools that have side effects, ask:

  • Which tools can execute without human review? Separate read-only tools (no approval needed) from write tools (approval required). Further separate destructive tools (delete, irreversible changes) as requiring explicit human confirmation with re-check of current state.

  • Is every write tool idempotent? If the model retries a failed call, does the second call create a duplicate or return the original result? Add idempotency keys to prevent double-execution of side effects.

  • Are tool permissions scoped per user or per session? Can the model access data belonging to other users, or is access limited to the current session’s authorised scope?

  • Is there an audit log of every tool call? Record: tool name, arguments (sanitised), session ID, user ID, approval status (auto or reviewed), and result.

  • Can the model’s tool access be revoked mid-session? If a harmful pattern is detected mid-conversation, can you disable tool execution for the rest of that session without terminating the whole conversation?

Evidence and caveats

  • Originating brief: 063-tool-use-safety-stopping-agents-from-taking-dangerous-actions.md
  • Check date: 2026-05-24
  • This draft uses current primary or near-primary sources only for the gap-fill citations requested by the brief.
  • No hands-on product claim is made unless the source path is explicit in the text.
  • If provider policy, retention, tool-use or citation docs change, this page should be re-checked before promotion.

Source and evidence notes

  • /run/function-calling-and-tool-use-where-agents-actually-fail/
  • /run/jailbreaks-vs-product-safety-what-operators-can-realistically-control/
  • /run/red-teaming-an-llm-feature-a-practical-first-week-checklist/

Methodology

What was checked: originating brief plus current provider/standards documentation relevant to the topic.

What the sources were used for:

  • to keep the claims cautious and specific;
  • to date the guidance where policy or operational details can move;
  • to avoid turning source notes into marketing copy.

Assumptions and limits:

  • This is an evergreen concept page, not a benchmark report.
  • No launch, outreach, affiliate, payment or tracking changes are implied.
  • The draft is public-clean and omits internal ticket IDs by design.

Change Log

  • 2026-05-27: Added direct source URLs to all named providers and services; added Change Log section. Content unchanged.