theLLMs

Last checked: 2026-05-25

Scope: Global. Sources checked as of 2026-05-24.

AI draft model: llm-author

AI review model: llm-editor (deepseek-v4-pro)

LLM observability cost: logs, traces and evaluation storage

Short answer: Observability for LLM apps costs 5–15% of your inference budget in storage, traces and evaluation compute — and the cost grows with prompt size, conversation depth, and retention period. Sampling and redaction are not optional; they are budget controls.

What it means

Every LLM call you monitor generates metadata: the full input prompt (often 2K–50K tokens), the model output, timestamps, model version, latency, token counts, and any tool calls or validation results. The cost is not the observability vendor subscription — it’s the storage, processing and evaluation compute on the data you keep.

A single high-volume conversation generates more observability data than the response that triggered it:

  • Input trace: system prompt + conversation history + retrieved chunks = 10K–50K tokens stored per call
  • Output trace: generated response + tool-call arguments = 1K–10K tokens
  • Metadata: timestamps, model IDs, latency splits, token counts, error flags
  • Evaluations: running a second model to grade the output adds its own token cost
  • Retention: 30-day vs 90-day vs perpetual storage multiplies everything

If you log every request at full prompt depth across 10,000 calls/day with 90-day retention, you are storing roughly 10–50 GB of prompt data per month, plus evaluation results and metadata.

Where teams misuse it

“We use an observability tool, so we’re covered.” Tools capture what you ask them to capture. By default, many log the full prompt and output for every request — exactly what you don’t want for PII-heavy workloads, and exactly the most expensive option for storage.

“We evaluate every output.” Using LLM-as-a-judge on every single response doubles your inference cost for the eval calls alone. If each eval costs $0.002 and you run 50,000/day, that’s $100/day just to grade your outputs. Sampling (e.g., 1% of traffic, all failures, all safety-refused queries) is often more informative and an order of magnitude cheaper.

“We keep everything for compliance.” Indiscriminate retention is expensive and risky. Privacy regulations (GDPR, UK DPA) may limit how long you can store full prompts containing personal data. Redaction at ingest time is cheaper than redaction at export time, but it still adds compute cost per logged call.

Practical decision check

Before choosing an observability setup, estimate:

  • Prompt size distribution — what is the 50th/95th/99th percentile token count per logged call?
  • Daily call volume — how many production LLM calls do you make?
  • Sampling rate — do you need 100% of calls, or can you sample and still catch regressions?
  • Retention requirement — 7 days for debugging, 30 days for trend analysis, 90+ days for compliance?
  • Redaction overhead — do you need to strip PII from logged prompts before storage?

A rough budget: observability (storage + evaluation compute) should be 5–15% of inference spend for a healthy system. Above 20%, you are either over-retaining or over-evaluating.

Tool choices affect cost

Different observability approaches have different cost profiles:

ApproachStorage costEval costPrivacy risk
Log full traces to cloud storageLow (S3/Blob storage is cheap)N/A if no evalHigh if unredacted
Vendor observability platform (LangSmith, Portkey, Helicone)Medium (included in seat/payload pricing)Varies (per-eval pricing)Medium — depends on vendor retention
Self-hosted OpenTelemetry + eval pipelineLow-medium (own infra)Compute onlyLowest — full control
Sampled traces + targeted evalLowestLowestDepends on sample selection

The cheapest option — log nothing — is not viable for production. The most expensive — log everything and evaluate everything — is rarely justified.

Mitigations worth trying first

  1. Sample at the trace level, not the call level — store every Nth conversation rather than dropping random individual calls. This preserves conversation context for debugging.
  2. Redact at ingest, not at export — strip email addresses, names, and IDs from prompts before they hit storage. This is cheaper than retroactive redaction.
  3. Use targeted evaluation — evaluate 100% of failure cases (safety refusals, malformed outputs) and 1–5% of happy-path cases. This catches regressions without doubling inference cost.
  4. Set retention tiers — full traces for 7 days, aggregated metrics forever, raw prompts for 30 days maximum unless compliance requires longer.
  5. Watch the eval-grading cost — LLM-as-a-judge evaluations cost real tokens. Consider deterministic checks (schema validation, keyword checks, regex) for the bulk of your monitoring.

Evidence and caveats

Change log

  • 2026-05-25 — Initial audit revision. Added direct source URLs to evidence section; changed source listing from named-only references to linked citations. No material changes to claims or guidance.