LLM observability cost: logs, traces and evaluation storage

Short answer: Observability for LLM apps costs 5–15% of your inference budget in storage, traces and evaluation compute — and the cost grows with prompt size, conversation depth, and retention period. Sampling and redaction are not optional; they are budget controls.

What it means

Every LLM call you monitor generates metadata: the full input prompt (often 2K–50K tokens), the model output, timestamps, model version, latency, token counts, and any tool calls or validation results. The cost is not the observability vendor subscription — it’s the storage, processing and evaluation compute on the data you keep.

A single high-volume conversation generates more observability data than the response that triggered it:

Input trace: system prompt + conversation history + retrieved chunks = 10K–50K tokens stored per call
Output trace: generated response + tool-call arguments = 1K–10K tokens
Metadata: timestamps, model IDs, latency splits, token counts, error flags
Evaluations: running a second model to grade the output adds its own token cost
Retention: 30-day vs 90-day vs perpetual storage multiplies everything

If you log every request at full prompt depth across 10,000 calls/day with 90-day retention, you are storing roughly 10–50 GB of prompt data per month, plus evaluation results and metadata.

Where teams misuse it

“We use an observability tool, so we’re covered.” Tools capture what you ask them to capture. By default, many log the full prompt and output for every request — exactly what you don’t want for PII-heavy workloads, and exactly the most expensive option for storage.

“We evaluate every output.” Using LLM-as-a-judge on every single response doubles your inference cost for the eval calls alone. If each eval costs $0.002 and you run 50,000/day, that’s $100/day just to grade your outputs. Sampling (e.g., 1% of traffic, all failures, all safety-refused queries) is often more informative and an order of magnitude cheaper.

“We keep everything for compliance.” Indiscriminate retention is expensive and risky. Privacy regulations (GDPR, UK DPA) may limit how long you can store full prompts containing personal data. Redaction at ingest time is cheaper than redaction at export time, but it still adds compute cost per logged call.

Practical decision check

Before choosing an observability setup, estimate:

Prompt size distribution — what is the 50th/95th/99th percentile token count per logged call?
Daily call volume — how many production LLM calls do you make?
Sampling rate — do you need 100% of calls, or can you sample and still catch regressions?
Retention requirement — 7 days for debugging, 30 days for trend analysis, 90+ days for compliance?
Redaction overhead — do you need to strip PII from logged prompts before storage?

A rough budget: observability (storage + evaluation compute) should be 5–15% of inference spend for a healthy system. Above 20%, you are either over-retaining or over-evaluating.

Tool choices affect cost

Different observability approaches have different cost profiles:

Approach	Storage cost	Eval cost	Privacy risk
Log full traces to cloud storage	Low (S3/Blob storage is cheap)	N/A if no eval	High if unredacted
Vendor observability platform (LangSmith, Portkey, Helicone)	Medium (included in seat/payload pricing)	Varies (per-eval pricing)	Medium — depends on vendor retention
Self-hosted OpenTelemetry + eval pipeline	Low-medium (own infra)	Compute only	Lowest — full control
Sampled traces + targeted eval	Lowest	Lowest	Depends on sample selection

The cheapest option — log nothing — is not viable for production. The most expensive — log everything and evaluate everything — is rarely justified.

Mitigations worth trying first

Sample at the trace level, not the call level — store every Nth conversation rather than dropping random individual calls. This preserves conversation context for debugging.
Redact at ingest, not at export — strip email addresses, names, and IDs from prompts before they hit storage. This is cheaper than retroactive redaction.
Use targeted evaluation — evaluate 100% of failure cases (safety refusals, malformed outputs) and 1–5% of happy-path cases. This catches regressions without doubling inference cost.
Set retention tiers — full traces for 7 days, aggregated metrics forever, raw prompts for 30 days maximum unless compliance requires longer.
Watch the eval-grading cost — LLM-as-a-judge evaluations cost real tokens. Consider deterministic checks (schema validation, keyword checks, regex) for the bulk of your monitoring.

Methodology

Data checked: 2026-05-28
Sources consulted: LangSmith observability platform, Portkey AI gateway and observability, Helicone observability for LLMs, LangFuse open-source LLM observability, OpenTelemetry LLM semantic conventions, AWS CloudWatch pricing, GCP Cloud Logging pricing, Azure Monitor pricing, ICO guidance on AI data retention
Assumptions: The 5–15% of inference spend guideline assumes moderate sampling. Compliance-heavy industries may need higher rates. Observability pricing changes frequently — check current vendor plans.
Limitations: Redaction quality depends on your PII detection approach; regex-based redaction misses context-dependent PII. The cost estimates are illustrative and depend on specific workload characteristics.
Jurisdiction: Global. UK GDPR and ICO guidance referenced where applicable.

Source list

LangSmith observability platform — https://smith.langchain.com/ (accessed 2026-05-28)
Portkey AI gateway and observability — https://portkey.ai/ (accessed 2026-05-28)
Helicone observability for LLMs — https://www.helicone.ai/ (accessed 2026-05-28)
LangFuse open-source LLM observability — https://langfuse.com/ (accessed 2026-05-28)
OpenTelemetry LLM semantic conventions — https://opentelemetry.io/docs/specs/semconv/llm/ (accessed 2026-05-28)
ICO guidance on AI data retention — https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/ (accessed 2026-05-28)

Trust Stack

Last checked: 2026-05-28
Corrections: Contact us to report errors

Change log

2026-05-28: Added 3 Editor’s Note cards, Methodology section, Trust Stack, Source list with access dates, slugified heading IDs. Content unchanged.
2026-05-25: Initial audit revision. Added direct source URLs to evidence section; changed source listing from named-only references to linked citations. No material changes to claims or guidance.