Promptfoo vs lm-eval-harness: when each is useful
If you are evaluating an LLM product, you will hear about two tools: Promptfoo and lm-eval-harness. They are both evaluation frameworks. They are not interchangeable, and choosing the wrong one for your use case creates more work without better answers.
Quick answer
- Promptfoo is for product regression testing: checking that your prompts, model configuration and tool calls still work correctly after a change.
- lm-eval-harness is for benchmark reproduction: comparing models against standardised academic tasks.
Use Promptfoo when you care about whether your product’s behaviour changed. Use lm-eval-harness when you care about whether one model scores higher than another on a known benchmark. If your team only has time to learn one, learn Promptfoo first — it covers the scenario that breaks in production most often.
What this means
Promptfoo (https://www.promptfoo.dev/) lets you define test cases for your prompts, run them against a model or provider configuration, and assert on the output. It is designed for the workflow that a product team runs dozens of times per week: “I changed the system prompt. Did the key behaviours hold?”
Key features for product teams:
- Define test cases as YAML or JSON: prompt, expected output pattern, model config.
- Run eval sets locally or in CI with assertions (exact match, contains, LLM-as-judge, cost tracking).
- Diff output across model versions, prompt versions or provider changes.
- Built-in GitHub Actions, GitLab CI and CLI integration.
lm-eval-harness (https://github.com/EleutherAI/lm-evaluation-harness) is from EleutherAI. It runs models against standardised academic benchmarks: MMLU, GSM8K, HumanEval, and hundreds more. It is designed for the workflow a research team runs when publishing a model or comparing baselines.
Key features for research teams:
- Standardised task definitions covering 200+ academic benchmarks.
- Reproducible evaluation with fixed metrics (accuracy, F1, perplexity, etc.).
- Supports local models via Hugging Face, vLLM, and API providers.
- The output is scores — not individual answer quality for product decisions.
The two tools have almost no overlap in what they are good at. Promptfoo tells you if your product broke. lm-eval-harness tells you if one model outperforms another on a standardised test.
Where teams misuse them
Using lm-eval-harness for prompt regression testing. A team runs dozens of lm-eval-harness tasks every time a prompt changes, trying to catch regressions. The harness was designed for benchmark reproducibility, not for product-level assertion testing — the output format is scores, not pass/fail per test case, and the setup cost is high for quick iterations. The team could have used Promptfoo to define five regression tests and run them in under a minute.
Using Promptfoo to benchmark models for a research paper. A research team compares two model families using Promptfoo’s default configuration without controlling for evaluation methodology. Promptfoo is designed for product regression, not for the controlled conditions needed to produce publishable benchmark numbers. The results may be valid for internal decisions but are not reproducible enough for public comparison claims.
Running both for every change. A team sets up both tools to run on every pull request, doubling CI time and maintenance burden. One tool is typically sufficient: Promptfoo for product changes, lm-eval-harness for occasional model comparison or baseline updates.
Practical decision check
| Question | If yes, use |
|---|---|
| Are you testing whether a prompt change broke expected behaviour? | Promptfoo |
| Are you comparing two models on MMLU, GSM8K or HumanEval? | lm-eval-harness |
| Do you need to catch regressions in CI before merge? | Promptfoo |
| Do you need to report reproducible academic benchmark scores? | lm-eval-harness |
| Do you need both product regression AND model comparison regularly? | Both (separate pipelines) |
Evidence and caveats
Sources:
- Promptfoo documentation: https://www.promptfoo.dev/docs/intro/
- Promptfoo GitHub Actions integration: https://www.promptfoo.dev/docs/integrations/github-actions/
- lm-eval-harness repository: https://github.com/EleutherAI/lm-evaluation-harness
- lm-eval-harness task library: https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks
Caveats:
- Both tools are actively maintained and may add overlapping features over time. Check current docs before committing.
- Tool choice also depends on your stack: Promptfoo has stronger CI integration; lm-eval-harness has stronger academic task coverage.
- Neither tool replaces human review of real user interactions.
Last checked: 2026-05-25.
Related reading
- Eval CI for AI apps: testing prompts before every release
- lm-eval-harness explained for non-researchers
- Creating a model scorecard for your own workload
- Synthetic eval datasets: useful shortcut or false confidence?
Change Log
- 2026-05-27: Added direct source URLs to all named providers and services; added Change Log section. Content unchanged.