LLM observability cost: logs, traces and evaluation storage
Short answer: Observability for LLM apps costs 5–15% of your inference budget in storage, traces and evaluation compute — and the cost grows with prompt size, conversation depth, and retention period. Sampling and redaction are not optional; they are budget controls.
What it means
Every LLM call you monitor generates metadata: the full input prompt (often 2K–50K tokens), the model output, timestamps, model version, latency, token counts, and any tool calls or validation results. The cost is not the observability vendor subscription — it’s the storage, processing and evaluation compute on the data you keep.
A single high-volume conversation generates more observability data than the response that triggered it:
- Input trace: system prompt + conversation history + retrieved chunks = 10K–50K tokens stored per call
- Output trace: generated response + tool-call arguments = 1K–10K tokens
- Metadata: timestamps, model IDs, latency splits, token counts, error flags
- Evaluations: running a second model to grade the output adds its own token cost
- Retention: 30-day vs 90-day vs perpetual storage multiplies everything
If you log every request at full prompt depth across 10,000 calls/day with 90-day retention, you are storing roughly 10–50 GB of prompt data per month, plus evaluation results and metadata.
Where teams misuse it
“We use an observability tool, so we’re covered.” Tools capture what you ask them to capture. By default, many log the full prompt and output for every request — exactly what you don’t want for PII-heavy workloads, and exactly the most expensive option for storage.
“We evaluate every output.” Using LLM-as-a-judge on every single response doubles your inference cost for the eval calls alone. If each eval costs $0.002 and you run 50,000/day, that’s $100/day just to grade your outputs. Sampling (e.g., 1% of traffic, all failures, all safety-refused queries) is often more informative and an order of magnitude cheaper.
“We keep everything for compliance.” Indiscriminate retention is expensive and risky. Privacy regulations (GDPR, UK DPA) may limit how long you can store full prompts containing personal data. Redaction at ingest time is cheaper than redaction at export time, but it still adds compute cost per logged call.
Practical decision check
Before choosing an observability setup, estimate:
- Prompt size distribution — what is the 50th/95th/99th percentile token count per logged call?
- Daily call volume — how many production LLM calls do you make?
- Sampling rate — do you need 100% of calls, or can you sample and still catch regressions?
- Retention requirement — 7 days for debugging, 30 days for trend analysis, 90+ days for compliance?
- Redaction overhead — do you need to strip PII from logged prompts before storage?
A rough budget: observability (storage + evaluation compute) should be 5–15% of inference spend for a healthy system. Above 20%, you are either over-retaining or over-evaluating.
Tool choices affect cost
Different observability approaches have different cost profiles:
| Approach | Storage cost | Eval cost | Privacy risk |
|---|---|---|---|
| Log full traces to cloud storage | Low (S3/Blob storage is cheap) | N/A if no eval | High if unredacted |
| Vendor observability platform (LangSmith, Portkey, Helicone) | Medium (included in seat/payload pricing) | Varies (per-eval pricing) | Medium — depends on vendor retention |
| Self-hosted OpenTelemetry + eval pipeline | Low-medium (own infra) | Compute only | Lowest — full control |
| Sampled traces + targeted eval | Lowest | Lowest | Depends on sample selection |
The cheapest option — log nothing — is not viable for production. The most expensive — log everything and evaluate everything — is rarely justified.
Mitigations worth trying first
- Sample at the trace level, not the call level — store every Nth conversation rather than dropping random individual calls. This preserves conversation context for debugging.
- Redact at ingest, not at export — strip email addresses, names, and IDs from prompts before they hit storage. This is cheaper than retroactive redaction.
- Use targeted evaluation — evaluate 100% of failure cases (safety refusals, malformed outputs) and 1–5% of happy-path cases. This catches regressions without doubling inference cost.
- Set retention tiers — full traces for 7 days, aggregated metrics forever, raw prompts for 30 days maximum unless compliance requires longer.
- Watch the eval-grading cost — LLM-as-a-judge evaluations cost real tokens. Consider deterministic checks (schema validation, keyword checks, regex) for the bulk of your monitoring.
Evidence and caveats
- Sources:
- Date checked: 2026-05-25. Observability pricing changes frequently — check current vendor plans.
- Caveats: The 5–15% of inference spend guideline assumes moderate sampling. Compliance-heavy industries may need higher rates. Redaction quality depends on your PII detection approach; regex-based redaction misses context-dependent PII.
- What would update this: Published cost benchmarks from teams running LLM observability at scale, or changes in vendor per-payload pricing models.
Change log
- 2026-05-25 — Initial audit revision. Added direct source URLs to evidence section; changed source listing from named-only references to linked citations. No material changes to claims or guidance.