theLLMs

Last checked: 2026-05-25

Scope: Global. Tool docs and repository state checked 2026-05-25; individual tool features may change with releases.

AI draft model: deepseek-v4-flash

AI review model: llm-editor (deepseek-v4-pro)

Promptfoo vs lm-eval-harness: when each is useful

If you are evaluating an LLM product, you will hear about two tools: Promptfoo and lm-eval-harness. They are both evaluation frameworks. They are not interchangeable, and choosing the wrong one for your use case creates more work without better answers.

Quick answer

  • Promptfoo is for product regression testing: checking that your prompts, model configuration and tool calls still work correctly after a change.
  • lm-eval-harness is for benchmark reproduction: comparing models against standardised academic tasks.

Use Promptfoo when you care about whether your product’s behaviour changed. Use lm-eval-harness when you care about whether one model scores higher than another on a known benchmark. If your team only has time to learn one, learn Promptfoo first — it covers the scenario that breaks in production most often.

What this means

Promptfoo (https://www.promptfoo.dev/) lets you define test cases for your prompts, run them against a model or provider configuration, and assert on the output. It is designed for the workflow that a product team runs dozens of times per week: “I changed the system prompt. Did the key behaviours hold?”

Key features for product teams:

  • Define test cases as YAML or JSON: prompt, expected output pattern, model config.
  • Run eval sets locally or in CI with assertions (exact match, contains, LLM-as-judge, cost tracking).
  • Diff output across model versions, prompt versions or provider changes.
  • Built-in GitHub Actions, GitLab CI and CLI integration.

lm-eval-harness (https://github.com/EleutherAI/lm-evaluation-harness) is from EleutherAI. It runs models against standardised academic benchmarks: MMLU, GSM8K, HumanEval, and hundreds more. It is designed for the workflow a research team runs when publishing a model or comparing baselines.

Key features for research teams:

  • Standardised task definitions covering 200+ academic benchmarks.
  • Reproducible evaluation with fixed metrics (accuracy, F1, perplexity, etc.).
  • Supports local models via Hugging Face, vLLM, and API providers.
  • The output is scores — not individual answer quality for product decisions.

The two tools have almost no overlap in what they are good at. Promptfoo tells you if your product broke. lm-eval-harness tells you if one model outperforms another on a standardised test.

Where teams misuse them

Using lm-eval-harness for prompt regression testing. A team runs dozens of lm-eval-harness tasks every time a prompt changes, trying to catch regressions. The harness was designed for benchmark reproducibility, not for product-level assertion testing — the output format is scores, not pass/fail per test case, and the setup cost is high for quick iterations. The team could have used Promptfoo to define five regression tests and run them in under a minute.

Using Promptfoo to benchmark models for a research paper. A research team compares two model families using Promptfoo’s default configuration without controlling for evaluation methodology. Promptfoo is designed for product regression, not for the controlled conditions needed to produce publishable benchmark numbers. The results may be valid for internal decisions but are not reproducible enough for public comparison claims.

Running both for every change. A team sets up both tools to run on every pull request, doubling CI time and maintenance burden. One tool is typically sufficient: Promptfoo for product changes, lm-eval-harness for occasional model comparison or baseline updates.

Practical decision check

QuestionIf yes, use
Are you testing whether a prompt change broke expected behaviour?Promptfoo
Are you comparing two models on MMLU, GSM8K or HumanEval?lm-eval-harness
Do you need to catch regressions in CI before merge?Promptfoo
Do you need to report reproducible academic benchmark scores?lm-eval-harness
Do you need both product regression AND model comparison regularly?Both (separate pipelines)

Evidence and caveats

Sources:

Caveats:

  • Both tools are actively maintained and may add overlapping features over time. Check current docs before committing.
  • Tool choice also depends on your stack: Promptfoo has stronger CI integration; lm-eval-harness has stronger academic task coverage.
  • Neither tool replaces human review of real user interactions.

Last checked: 2026-05-25.

Change Log

  • 2026-05-27: Added direct source URLs to all named providers and services; added Change Log section. Content unchanged.