Eval CI for AI apps: testing prompts before every release
If prompts, tools or model settings can change your product, they should be tested before release. Eval CI is the boring but useful habit of making that check part of the pipeline instead of a heroic manual step.
Quick answer
Put a small, relevant eval set into CI and fail the release when important scores drift outside agreed thresholds. Keep the suite cheap enough that teams will actually run it. Manual spot checks are useful, but they are not a release system, and a gate that is too expensive to run will be skipped — the suite has to be small enough to survive reality.
What this means
The point of eval CI is not perfect scientific measurement. The point is to make quality drift visible before users find it. That means the suite has to be repeatable, fast and tied to the release cadence.
A minimal example: a GitHub Actions workflow that runs promptfoo eval against a small regression set (tens of examples, not thousands) on every pull request that touches a prompt file. If the pass rate drops below an agreed threshold — say 80% on the core task — the workflow returns a non-zero exit code and the PR cannot merge. Promptfoo’s docs cover this exact pattern, and DeepEval offers a similar integration with its own assertion framework. Neither requires a dedicated server; both run as a single CI step.
# .github/workflows/eval-ci.yml — minimal example
name: prompt-eval
on:
pull_request:
paths: ['prompts/**']
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm install
- run: npx promptfoo eval --config promptfooconfig.yaml
- run: npx promptfoo assert --threshold 0.8
The suite should cover the product’s most common failure modes: hallucination on known facts, refusal on safe queries, instruction-following drift. OpenAI Evals and Anthropic’s evaluation guide both cover how to design these small regression sets — pick the scenarios that hurt most when they break.
Where teams get it wrong
Running evals after release instead of before. A team ships a prompt update on Friday, runs the eval suite on Monday as a “quality check,” finds a regression that affects 5% of responses, and has no clean rollback path. By then the bad responses have already reached users, support tickets are already filed, and the fix requires another deploy cycle. The eval suite was useful only as a post-mortem — it told them what went wrong but not before the damage happened. Eval CI means the gate fires before merge, not after deploy.
Using a huge suite that is too slow for everyday work. An engineering team builds a comprehensive eval suite with hundreds of test cases covering every known edge case. Each run takes 45 minutes. Developers start skipping it before merging quick fixes — “it’s just a wording change” — and the eval effectively becomes an occasional batch job rather than a release gate. The suite was correct in coverage but wrong in design: a fast, representative subset that runs in under 5 minutes catches 90% of regressions and actually gets used. The full suite can run nightly for deeper analysis.
Treating one score as proof that the whole feature is safe. A team watches the overall pass rate hover at 92% across releases and calls it good enough. But the 8% failure rate is concentrated in one category — say, queries about financial advice — which means a specific safety-critical failure is recurring silently. An aggregate score hides distribution. The useful signal is per-category tracking: if the “financial advice” category drops from 95% to 80% while the headline score barely moves, that is a real regression the combined number would miss.
Practical decision check
- What behaviour must not regress?
- What threshold is strict enough to matter but loose enough to be useful?
- Who gets alerted when the gate fails?
- How do you distinguish a real regression from prompt sensitivity that needs threshold tuning?
What this page cannot tell you
This page cannot tell you the perfect CI threshold for your product. It can only help you turn “we should test this” into an automated release habit.
Global applicability
The pattern is universal: if a prompt or model change can affect user-facing quality, the release process should catch it before the change ships.
Methodology and sources
Check date: 2026-05-24
What was checked: CI, eval and release-gating documentation
What the sources were used for:
- connecting evals to release discipline rather than one-off analysis
- showing why suite size and runtime matter (Promptfoo, DeepEval, GitHub Actions docs)
- keeping thresholds and ownership explicit
- providing concrete CI workflow patterns (Promptfoo GitHub Actions integration, DeepEval assertion framework)
Assumptions and limits:
- release cadence is regular enough to automate
- the team can define a few high-value failure modes
- this is process guidance, not an absolute safety guarantee
What would change this advice: If non-deterministic scoring becomes reliable enough to replace human judgment on individual outputs, or if model providers introduce built-in eval gates at the API level, the case for a separate CI step weakens. If eval suite runtimes drop dramatically (sub-second on hundreds of examples), the trade-off between coverage and speed disappears and teams should run broader suites. If evaluation-as-a-service tools mature to integrate directly into GitHub/GitLab merge queues with minimal config, the manual workflow setup in this guide would be superseded by provider-managed gates.
Change log
- 2026-05-24: first draft built from the llm-editor-approved brief.
- 2026-05-24: revision — added pipeline pseudocode, expanded failure-mode examples, integrated inline notes, fixed internal links, added evidence-change paragraph.
Source list
- Promptfoo GitHub Actions integration — https://www.promptfoo.dev/docs/integrations/github-actions/
- DeepEval GitHub Actions guide — https://docs.confident-ai.com/docs/github-actions-integration
- OpenAI Evals docs — https://platform.openai.com/docs/guides/evals
- GitHub Actions docs — https://docs.github.com/actions
- Anthropic evaluation docs — https://docs.anthropic.com/en/docs/evaluate
- NIST AI RMF — https://www.nist.gov/itl/ai-risk-management-framework