theLLMs

Last checked: 2026-05-24

Scope: Global. Provider and standards sources checked as of 2026-05-24.

AI draft model: gpt-5.4-mini

AI review model: llm-editor (deepseek-v4-pro)

AI incident response: what to do when a model gives harmful or wrong advice

When an AI feature gives bad advice, the incident is not just the answer. It is the path that let the answer reach a user, and the path that failed to stop it in time.

Treat harmful outputs like product incidents: contain, preserve evidence, roll back or disable the risky path, notify the right people, and only then ask how the model got there.

The right playbook depends on severity. A wrong suggestion is not the same as a safety-critical mistake, but both need a clear path for containment and review.

Trust stack

AI draft model: gpt-5.4-mini. AI review model: gpt-5.4. Checked against the originating brief and current primary/near-primary sources on 2026-05-24.

Quick answer

Treat harmful outputs like product incidents: contain, preserve evidence, roll back or disable the risky path, notify the right people, and only then ask how the model got there.

What this means

AI incident response follows the same structure as any product incident — contain, triage, fix, learn — but with an extra step: you need to separate the model output from the product path that delivered it. The output was wrong. But was the failure in the model, the prompt, the retrieved context, the tool permission, the approval gate, or the output validation? The answer determines whether you fix the system prompt, add a guardrail, update the RAG source, or disable the tool.

The incident severity determines the response speed and escalation path. A P1 incident (harmful output reaching a wide audience, safety-critical error, data exposure) needs immediate containment — disable the feature or route. A P3 incident (wrong advice that is annoying but not harmful, formatting error in a low-traffic surface) can be queued for the next sprint. Most teams skip the severity classification and treat every model mistake as urgent, which means nothing is actually urgent.

Where teams misuse it

  • No severity ladder for AI incidents. Without a severity classification, a model generating a slightly wrong product description gets the same response as a model generating harmful medical advice. Both get escalated, both get a post-mortem, and the team fatigues on the low-severity ones while the high-severity ones get lost in the noise.

  • Fixing the prompt but leaving the tool path open. A model generated an incorrect refund amount. The team updates the system prompt to say “always verify refund amounts.” But the tool that executes refunds has no approval gate and no upper limit check. The prompt fix reduces the surface but does not fix the product — the same failure can recur with a differently phrased user request.

  • Skipping evidence preservation before rolling back. When a harmful output is detected, the team rolls back the model version or disables the feature immediately — the right instinct. But they do not preserve the prompt, context, tool calls, and output that caused the incident. When the post-mortem starts, nobody has the data to understand what happened.

  • Treating every incident as a model problem. A model gave a wrong answer. The investigation stops at “the model hallucinated.” But the real failure was in retrieval (the wrong document was returned), tool permissions (the model had write access it should not have had), or output validation (the check that should have caught the wrong fact was missing). Model behaviour is the last thing to blame, not the first.

Severity ladder reference

A practical severity classification for AI incidents:

  • P1 — Critical: The model generated output that caused or could cause real harm (safety advice, medical, financial, legal); PII was exposed to unauthorised users; the model executed a destructive tool call without approval. Response: disable the feature or route immediately, preserve evidence, notify stakeholders within 1 hour, full post-mortem within 48 hours.

  • P2 — High: The model generated consistently wrong factual answers in a customer-facing surface; a minor policy violation in output; a tool call that caused a non-destructive but visible side effect (wrong ticket created, wrong email sent). Response: disable if practical, otherwise add a warning banner or output filter; notify within 4 hours; post-mortem within 1 week.

  • P3 — Moderate: The model generated an incorrect answer in an internal tool; a formatting issue; a single wrong fact that the user can reasonably disregard. Response: queue fix for next sprint; log the incident; review in next team triage.

  • P4 — Low: The model was less helpful than expected; a response was verbose; a suggestion was technically correct but not useful. Response: log as eval feedback; no incident process needed.

Practical decision check

When an AI failure is reported, ask:

  • What severity class does this belong to? Use the ladder above. A wrong answer to “how do I cancel my account?” is different from a wrong answer to “what medication should I take?”

  • What evidence needs to be preserved before anything changes? Capture the exact prompt, retrieved context, tool calls (if any), model response, timestamp, model version, and deployment ID. Preserve before rolling back.

  • What part of the stack failed? Model behaviour, retrieval quality, prompt design, tool permission, approval gate, output validation, or user context? Fix the layer that failed, not the nearest symptom.

  • Can the risky path be disabled without disabling the whole feature? If a specific tool or retrieval source caused the failure, disable that path rather than turning off the entire AI surface.

  • What would stop the same failure from recurring? A prompt fix alone is rarely enough. Add an automated guardrail, a human review gate, a retrieval boundary, or a model-agnostic output validator.

Evidence and caveats

  • Originating brief: 069-ai-incident-response-what-to-do-when-a-model-gives-harmful-or-wrong-advice.md
  • Check date: 2026-05-24
  • This draft uses current primary or near-primary sources only for the gap-fill citations requested by the brief.
  • No hands-on product claim is made unless the source path is explicit in the text.
  • If provider policy, retention, tool-use or citation docs change, this page should be re-checked before promotion.

Source and evidence notes

  • /run/jailbreaks-vs-product-safety-what-operators-can-realistically-control/
  • /run/tool-use-safety-stopping-agents-from-taking-dangerous-actions/
  • /run/ai-output-monitoring-what-to-log-sample-and-review/

Methodology

What was checked: originating brief plus current provider/standards documentation relevant to the topic.

What the sources were used for:

  • to keep the claims cautious and specific;
  • to date the guidance where policy or operational details can move;
  • to avoid turning source notes into marketing copy.

Assumptions and limits:

  • This is an evergreen concept page, not a benchmark report.
  • No launch, outreach, affiliate, payment or tracking changes are implied.
  • The draft is public-clean and omits internal ticket IDs by design.

Change Log

  • 2026-05-27: Added direct source URLs to all named providers and services; added Change Log section. Content unchanged.