Synthetic eval datasets: useful shortcut or false confidence?

Generating test cases with an LLM is much faster than writing them by hand. A few hours of prompting can produce hundreds of eval examples covering a range of scenarios. The question is whether those examples are any good.

Synthetic eval datasets are a useful shortcut — when you treat them as a starting point, not as proof. The danger is that a dataset generated by the same class of model you are evaluating can miss the same blind spots, encode the same biases and give you confident but hollow pass rates.

TL;DR

Use synthetic data to seed your eval set, not to complete it. Generate examples, then review manually, remove low-quality cases, add real-world examples and measure drift between synthetic and human-written results. A dataset that passes only synthetic tests is not evidence of product readiness.

What this means

Why synthetic generation is attractive: Hand-writing eval examples is slow and expensive. A product team changing prompts every week cannot hand-craft 200 test cases per iteration. Synthetic generation — using an LLM to produce input-output pairs, expected behaviours or edge cases — can build a seed set in minutes. Tools like Promptfoo, DeepEval and OpenAI Evals all support synthetic generation as a starting step.

What can go wrong:

Distributional overlap: The model used to generate test cases produces examples that reflect its own training distribution — which is similar to the model being evaluated. The eval set becomes a closed loop where the model “passes” because both generation and evaluation draw from the same pool of patterns. Real user inputs often look completely different.
Missing hard cases: Synthetic generation tends to produce average or typical examples because that is what language models are best at. The edge cases — unusual phrasing, contradictory instructions, domain-specific jargon — are underrepresented, which means your eval set will miss the scenarios that actually break in production.
Bleached language: LLM-generated test cases have a recognisable style: well-formed sentences, consistent grammar, predictable structure. Real user inputs are messy, fragmented and unpredictable. An eval set that only contains clean examples will overestimate real-world performance dramatically.
Seed contamination: If the generation prompt contains examples that look like the expected answers, the generated test cases may inadvertently include answer patterns that make the eval easier to pass than a real human-written test set would.

What to do about it:

Generate synthetic examples as a seed set — aim for 3× the target eval size to allow for pruning.
Review every example manually. Delete or rewrite cases that are too similar to each other, too easy, or obviously unrealistic.
Add at least 20% real examples drawn from actual user data, support logs or known failure cases.
Track separate pass rates for synthetic vs real subsets. If the synthetic pass rate is significantly higher, the set has a distribution problem.
Refresh the synthetic portion periodically — the model’s distribution drifts, and stale synthetic data becomes less representative.

Where teams misuse it

Publishing eval scores based entirely on synthetic data. A team generates 500 synthetic test cases, runs evaluations, and reports “95% pass rate” without disclosing that the dataset is entirely LLM-generated. The number is misleading because the dataset is not representative of real user behaviour, and the generation model’s blind spots are inherited by the eval set.

Using the same model to generate and evaluate. A team uses GPT-4 to generate test cases and then uses GPT-4-as-judge to score the results. The eval becomes a self-consistency check: it measures whether one GPT-4 run agrees with another GPT-4 run, not whether the output is correct for a human user. Position bias, verbosity bias and style preference of the judging model inflate scores.

Never refreshing the dataset. A team builds a synthetic eval set once and uses it for six months. User behaviour changes, the product adds new features, and the eval set continues testing scenarios that no longer exist while missing entirely new ones. The synthetic set becomes a comfortable but irrelevant benchmark.

Practical decision check

Before using a synthetic eval dataset to make product decisions:

What proportion of the dataset was human-reviewed? If less than 50%, assume significant quality risk.
Does the dataset include real user examples, or is it entirely generated?
Are pass rates reported separately for synthetic and real subsets?
Does the dataset include known failure cases from production, or only “happy path” examples?
When was the dataset last refreshed, and what changed in the product since then?

Methodology

Data checked: 2026-05-25
Sources consulted: Evaluation tool documentation (Promptfoo, OpenAI Evals, DeepEval), academic literature on behavioural testing of NLP models (Ribeiro et al. 2020), and published synthetic data generation methodologies.
Assumptions: Synthetic data quality depends heavily on the generation prompt and the model used. A poorly written generation prompt produces uniformly low-quality examples. No synthetic generation process has been shown to fully replace human-written examples for safety-critical eval sets.
Limitations: This guide covers general principles for synthetic eval dataset creation and review. It does not provide a production-ready generation pipeline or specific tool recommendations. Manual review should happen each time the dataset is extended.
Jurisdiction: Global. Synthetic data methodology is jurisdiction-agnostic; specific regulatory requirements for evaluation data quality and representativeness may apply under the EU AI Act or sector-specific rules.

Source list

Promptfoo synthetic data generation — https://www.promptfoo.dev/docs/guides/synthetic-data/ (accessed 2026-05-25)
OpenAI Evals generation patterns — https://platform.openai.com/docs/guides/evals (accessed 2026-05-25)
DeepEval synthetic data — https://docs.confident-ai.com/docs/synthetic-data (accessed 2026-05-25)
Ribeiro et al., “Beyond Accuracy: Behavioral Testing of NLP Models with CheckList,” ACL 2020 — https://aclanthology.org/2020.acl-main.442/ (accessed 2026-05-25)

Trust Stack

Last checked: 2026-05-28
Corrections: Contact us to report errors

Change log

2026-05-28: full editorial review — added Editor’s Notes, Trust Stack, slugified heading IDs, formal Methodology and Source list sections, renamed “Related reading” to “Related guides”, fixed writtenBy label (editor)
2026-05-25: first published