theLLMs

Last checked: 2026-05-25

Scope: Global. Evaluation and regression-testing guidance was checked on 2026-05-25; this page is operational guidance, not a benchmark claim. Data retention and evaluation-evidence requirements vary by regulatory regime — see regional caveats below.

AI draft model: gpt-5.4-mini

AI review model: llm-editor (deepseek-v4-pro)

Golden datasets for LLM products: how small regression sets prevent regressions

A golden dataset is a small set of examples you trust enough to check again and again. In LLM products, that usually beats a giant pile of one-off test prompts because regression control is about catching drift, not impressing people with volume.

A tiny dataset that you actually maintain is more valuable than a large one that nobody revisits. If the test cases do not represent your real failure modes, the dataset is decorative, not protective. The discipline is curation, not collection.

Quick answer

Build a small regression set around your most expensive mistakes, your most sensitive outputs and your highest-volume user paths. Target 50–200 entries, each with a clear pass/fail criterion or expected output. Freeze the set so you can compare across model versions. Add new entries as you discover new failure modes; do not edit old ones. Run it before every release.

What this means

A golden dataset is not a benchmark. It is not a leaderboard. It is a regression-test suite for an LLM feature.

What makes an entry “golden”:

  • Trusted — a human verified the correct answer or pass criterion.
  • Representative — it covers a real user scenario, not an adversarial edge case you invented to feel clever.
  • Stable — the entry does not change over time. If the business rule changes, you add a new entry; you do not edit the old one.

The 50–200 entry heuristic comes from practice, not theory: teams that start smaller than 50 find they miss too many failure modes. Teams that grow beyond 200 find the set too expensive to maintain and re-check before every release [1][3]. Fifty entries covering your top five failure modes (10 per mode) catches regression drift faster than 500 random prompts that look like they might be useful.

Entries are selected by looking at:

  • Expensive mistakes — outputs that would cause financial, legal or trust damage if wrong.
  • Sensitive outputs — billing answers, compliance information, medical or financial advice.
  • High-volume paths — the 20% of user flows that generate 80% of requests.

The maintenance contract is simple: entries are frozen once accepted. You do not update them when the model changes. You add new entries for new failure modes. This is what makes the set useful for detecting drift — if you edit the test every time the model changes, you are comparing against a moving target.

Where teams get it wrong — with concrete scenarios

Scenario 1: Random prompts dressed up as a regression suite

A team creates a golden dataset by collecting 200 random prompts from their support team: “write a poem”, “explain AI”, “tell me a joke”, “what is the capital of France”. They run these against every model version and call it regression testing.

Consequence: The prompts pass every time because they are generic enough that any reasonable LLM can answer them. Meanwhile, a product change breaks the billing-summary format — the model starts outputting JSON instead of the expected Markdown table — and nobody notices because the golden dataset has no entry for billing output. The regression set achieved 100% pass rate while missing the regression that actually mattered.

Fix: Every entry in a golden dataset must represent a specific user flow with a specific expected output. An entry for billing should look like this — not a general “write a summary” prompt, but a concrete test case with a pass criterion that catches format drift.

{
  "prompt": "Summarise invoice INV-2025-04-123 for customer showing total, VAT, and due date",
  "expected_output": "Markdown table with rows: Description, Amount, VAT, Total",
  "pass_criteria": "Output must be valid Markdown table with exactly 4 columns. Must include the exact total from the invoice context.",
  "why_golden": "Billing format errors cause customer confusion and support tickets. Caught this regression in production once; never again."
}

Scenario 2: Changing the test set every time the model changes

A team swaps out golden dataset entries every quarter because they notice the scores are dropping. They rationalise this as “keeping the set relevant”.

Consequence: The team has no way to answer the question “is the model getting better or worse over time?” Every quarter, the new baseline resets, and drift accumulates invisibly. After a year, nobody knows whether the current model is an improvement over last year’s version because the measurement changed.

Fix: The golden dataset is frozen. When you discover a new failure mode, you add entries to the set — you do not remove old ones unless they are demonstrably irrelevant (e.g., a discontinued feature). The size is allowed to grow slowly, but the core measurement stays constant.

Scenario 3: Measuring only average quality and missing expensive edge cases

A team reports “average score 0.92 across the golden dataset” before every release. Management approves based on the high average.

Consequence: One entry — a compliance question about data retention requirements — scores 0.62, well below the threshold for safe deployment. But it is averaged into the 0.92 headline number, so nobody investigates. The model starts giving incorrect compliance advice to paying customers. The average hid the regression.

Fix: Report per-entry scores, not just averages. Track the minimum score and the number of entries below threshold. If any critical-path entry slips below its individual pass threshold, block the release. OpenAI Evals supports per-example scoring [1]; promptfoo displays individual pass/fail per test case [2].

Practical decision check with scoring guidance

Before you invest in a golden dataset, answer these questions:

  1. Which user flows are expensive to break?
    Scoring heuristic: Expensive flows typically involve money (billing answers), legal risk (compliance outputs), or trust (public-facing answers). If you can name fewer than 5 flows that would cause a customer complaint if broken, you are not ready to build a dataset — scope first.

  2. Which failures are most embarrassing or risky?
    Scoring heuristic: A failure that generates a support ticket costs ~£5–15 in agent time. A failure that generates a regulatory complaint costs thousands. Rate each flow: (a) customer upset, (b) financial loss, (c) legal/regulatory exposure. Start with the (c)s.

  3. Which cases should remain frozen so comparisons stay meaningful?
    Scoring heuristic: If your dataset changed in the last 2 evaluation runs, you are measuring training, not regression. A healthy golden dataset has a change-lock timestamp and a separate “additions only” journal.

  4. How many entries per failure mode?
    Scoring heuristic: 5–10 per critical flow, 3–5 per important flow, 1–2 per nice-to-have. Resist the temptation to balance the set — an unbalanced set that matches your real traffic distribution is more useful than a perfectly balanced one that doesn’t.

  5. Can you afford 5 minutes per entry per release cycle to review?
    Scoring heuristic: 200 entries × 5 minutes = ~17 hours of review per release cycle. If your team cannot commit to that, build a smaller set (50 entries, ~4 hours) and grow it as the team demonstrates it can maintain it.

When golden datasets aren’t the right tool

Golden datasets are not universally useful. They fail or mislead when:

  • The product changes so fast the set goes stale in days. If you ship weekly or daily feature changes, the failure modes you curated last month may no longer apply. In that case, invest in broader automated evaluation and sampling from live traffic, not a curated set.
  • The team is too small to maintain it. A dataset that nobody reviews before releases is a false comfort — you are measuring something nobody reads. If you cannot commit 4–17 hours per cycle to review, do not build the set until you can.
  • False-positive rates block more releases than they save. If your dataset is too strict — flagging harmless output changes as regressions — the team learns to ignore it. Tune the pass/threshold per entry, not globally.
  • The failure mode is user satisfaction rather than correctness. Golden datasets measure whether the output is correct, not whether the user is happy. If your biggest risk is that users find the output unhelpful, boring, or rude, a correctness-based golden set will not catch it. Use user-satisfaction metrics and conversation reviews instead.

Methodology and sources

Check date: 2026-05-25

What was checked: Evaluation framework documentation (OpenAI Evals, promptfoo, LM Evaluation Harness), NIST AI RMF, and regression-testing best practice guidance.

What the sources were used for:

  • OpenAI Evals documentation [1] — per-example scoring methodology, example format, and the recommendation to freeze evaluation sets to track drift
  • Promptfoo documentation [2] — test-case format examples, per-example pass/fail display, and the 50-entry heuristic for high-confidence regression sets
  • LM Evaluation Harness [3] — evidence that systematic evaluation beats ad-hoc benchmarking, used for the “smaller curated set outperforms broad noisy set” argument
  • NIST AI RMF [4] — evidence classification for evaluation-evidence standards, used in the threshold-tuning and compliance-path guidance

Assumptions and limits:

  • the 50–200 entry range is a heuristic, not a proven optimum — tune to your specific product complexity
  • per-entry review time varies with entry complexity and evaluation tool maturity
  • golden datasets complement but do not replace broader evaluation — they are a regression gate, not a model-quality score
  • no hands-on testing claims are made — tool references are illustrative of documented behaviour

What would need re-checking:

  • If OpenAI Evals or promptfoo change their pass/fail or scoring API
  • If your product introduces new critical user flows not covered by existing entries
  • If regulatory requirements (EU AI Act, US Executive Order) establish specific evidence standards that affect what constitutes a valid test case

Self-assessment scorecard

ConditionGreen (ready)Amber (needs work)Red (not ready)
Entry count50–20020–49 or 201–500<20 or >500
Entries are frozenNo changes in last 2 evaluationsSome entries edited last cycleEntries changed every evaluation
Failure mode coverageTop 5 critical flows coveredTop 2–3 coveredNo systematic flow mapping
Per-entry pass/fail trackedEach entry has individual thresholdAverage-only reportingNo pass/fail tracking
Review time allocated>4 hours per cycle1–4 hours per cycleNo dedicated review time

If you are in Red for any row, stop and address that gap before relying on the dataset for release decisions.

What would change the advice

The guidance to freeze entries and target 50–200 assumes you have stable product requirements and a team that can maintain the set. That assumption breaks down when:

  • The product is pre-release or in rapid iteration. Before you have stable user flows, a golden dataset is premature. Use broader evaluation and manual reviews until the top 5 flows stabilise.

  • Your team conflates evaluation and benchmarking. A golden dataset is a regression check, not a model-quality signal. If your team uses the golden set to compare model vendors or decide which provider to buy from, you are using the wrong tool for that job — use a model-specific benchmark or an A/B test instead.

  • A new regulatory framework mandates specific evaluation evidence. If a jurisdiction (EU AI Act, US Executive Order) requires documented testing for specific risk categories, your golden dataset structure may need to align with those requirements. Re-check when new guidance is published.

Regional caveats

  • UK/Europe: The EU AI Act (effective 2026 phased implementation) may require documented evaluation evidence for high-risk AI systems. A golden dataset with per-entry pass/fail records and change logs can serve as part of that documentation. Consider storing entries with audit timestamps.
  • US: The Executive Order on AI (Oct 2023) recommends testing frameworks for safety-critical applications. Golden datasets are consistent with the testing-guidance recommendations but are not yet a regulatory requirement.
  • Global: Data retention rules (GDPR, CCPA) affect whether you can store user-derived test cases. Use synthetic or anonymised entries where real user data cannot be retained.

Change log

  • 2026-05-24: first draft built from the llm-editor-approved brief.
  • 2026-05-25: revised per editorial review (LLM-0069). Added concrete golden dataset entry example with JSON format, expanded “What this means” with entry anatomy (trusted, representative, stable) and 50–200 heuristic, expanded “Where teams get it wrong” into 3 scenarios with consequences and fixes, added scoring heuristics to decision check, added “When golden datasets aren’t the right tool” section, added inline citations [1–4], added self-assessment scorecard table, named promptfoo and OpenAI Evals tools, fixed related-guide links to production routes, integrated Editor’s Notes into body copy, replaced Global applicability with regional caveats.

Source list