hero_image:- “/images/hero/eval-ci-for-ai-apps-testing-prompts-before-every-release.png” layout:- ../../layouts/GuideLayout.astro title:- “Eval- CI- for- AI- apps:- testing- prompts- before- every- release” description:- “A- guide- to- adding- automated- LLM- evaluation- to- your- CI- pipeline- —- so- prompt- and- model- changes- are- tested- before- they- reach- users,- not- after.” writtenBy:- “gemma4:26b” reviewedBy:- “deepseek-r1:32b” lastChecked:- “2026-05-28” scope:- “Global.- Evaluation- and- CI- documentation- was- checked- on- 2026-05-28;- this- page- is- operational- guidance,- not- a- warranty- of- quality.”

#- Eval- CI- for- AI- apps:- testing- prompts- before- every- release

If- prompts,- tools- or- model- settings- can- change- your- product,- they- should- be- tested- before- release.- Eval- CI- is- the- boring- but- useful- habit- of- making- that- check- part- of- the- pipeline- instead- of- a- heroic- manual- step.

##- TL;DR

Put- a- small,- relevant- eval- set- into- CI- and- fail- the- release- when- important- scores- drift- outside- agreed- thresholds.- Keep- the- suite- cheap- enough- that- teams- will- actually- run- it.- Manual- spot- checks- are- useful,- but- they- are- not- a- release- system,- and- a- gate- that- is- too- expensive- to- run- will- be- skipped- —- the- suite- has- to- be- small- enough- to- survive- reality.

- - Editor's- Note - -

The- most- common- failure- mode- for- eval- CI- is- not- that- the- tests- are- wrong- —- it- is- that- they- are- too- slow.- If- your- eval- suite- takes- more- than- 5- minutes,- developers- will- skip- it.- A- fast,- representative- subset- of- tens- of- examples- that- catches- 90%- of- regressions- is- worth- more- than- a- comprehensive- suite- that- nobody- runs.

##- What- this- means

The- point- of- eval- CI- is- not- perfect- scientific- measurement.- The- point- is- to- make- quality- drift- visible- before- users- find- it.- That- means- the- suite- has- to- be- repeatable,- fast- and- tied- to- the- release- cadence.

A- minimal- example:- a- GitHub- Actions- workflow- that- runs- promptfoo- eval- against- a- small- regression- set- (tens- of- examples,- not- thousands)- on- every- pull- request- that- touches- a- prompt- file.- If- the- pass- rate- drops- below- an- agreed- threshold- —- say- 80%- on- the- core- task- —- the- workflow- returns- a- non-zero- exit- code- and- the- PR- cannot- merge.- Be- cautious- with- synthetic- datasets:- they- are- fast- to- generate- but- can- create- false- confidence- if- the- synthetic- distribution- does- not- match- real- user- inputs.- Promptfoo’s- docs- cover- this- exact- pattern,- and- DeepEval- offers- a- similar- integration- with- its- own- assertion- framework.- Neither- requires- a- dedicated- server;- both- run- as- a- single- CI- step.

#- .github/workflows/eval-ci.yml- —- minimal- example
name:- prompt-eval
on:
- - pull_request:
- - paths:- ['prompts/**']
jobs:
- - eval:
- - runs-on:- ubuntu-latest
- - steps:
- - uses:- actions/checkout@v4
- - run:- npm- install
- - run:- npx- promptfoo- eval- --config- promptfooconfig.yaml
- - run:- npx- promptfoo- assert- --threshold- 0.8

The- suite- should- cover- the- product’s- most- common- failure- modes:- hallucination- on- known- facts,- refusal- on- safe- queries,- instruction-following- drift.- OpenAI- Evals- and- Anthropic’s- evaluation- guide- both- cover- how- to- design- these- small- regression- sets- —- pick- the- scenarios- that- hurt- most- when- they- break.

##- Where- teams- get- it- wrong

Running- evals- after- release- instead- of- before.- A- team- ships- a- prompt- update- on- Friday,- runs- the- eval- suite- on- Monday- as- a- “quality- check,”- finds- a- regression- that- affects- 5%- of- responses,- and- has- no- clean- rollback- path.- By- then- the- bad- responses- have- enough- already- reached- users,- support- tickets- are- already- filed,- and- the- fix- requires- another- deploy- cycle.- The- eval- suite- was- useful- only- as- a- post-mortem- —- it- told- them- what- went- wrong- but- not- before- the- damage- happened.- Eval- CI- means- the- gate- fires- before- merge,- not- after- deploy.

Using- a- huge- suite- that- is- too- slow- for- everyday- work.- An- engineering- team- builds- a- comprehensive- eval- suite- with- hundreds- of- test- cases- covering- every- known- edge- case.- Each- run- takes- 45- minutes.- Developers- start- skipping- it- before- merging- quick- fixes- —- “it’s- just- a- wording- change”- —- and- the- eval- effectively- becomes- an- occasional- batch- job- rather- than- a- release- gate.- The- suite- was- correct- in- coverage- but- wrong- in- design:- a- fast,- representative- subset- that- runs- in- under- 5- minutes- catches- 90%- of- regressions- and- actually- gets- used.- The- full- suite- can- run- nightly- for- deeper- analysis.

Treating- one- score- as- proof- that- the- whole- feature- is- safe.- A- team- watches- the- overall- pass- rate- hover- at- 92%- across- releases- and- calls- it- good- enough.- But- the- 8%- failure- rate- is- concentrated- in- one- category- —- say,- queries- about- financial- advice- —- which- means- a- specific- safety-critical- failure- is- recurring- silently.- An- aggregate- score- hides- distribution.- The- useful- signal- is- per-category- tracking:- if- the- “financial- advice”- category- drops- from- 95%- to- 80%- while- the- headline- score- barely- moves,- that- is- a- real- regression- the- combined- number- would- miss.

- - Editor's- Note - -

Per-category- tracking- is- the- difference- between- knowing- you- have- a- problem- and- knowing- which- problem- you- have.- Set- up- category-level- thresholds- alongside- your- aggregate- score- from- day- one.- An- aggregate- pass- rate- that- stays- at- 92%- while- your- safety-critical- category- drops- to- 70%- is- not- a- pass- —- it- is- a- near- miss- the- dashboard- was- designed- to- hide.

##- Practical- decision- check

|- What- behaviour- must- not- regress? |- What- threshold- is- strict- enough- to- matter- but- loose- enough- to- be- useful? |- Who- gets- alerted- when- the- gate- fails? |- How- do- you- distinguish- a- real- regression- from- prompt- sensitivity- that- needs- threshold- tuning?

- - Editor's- Note - -

The- last- question- —- distinguishing- real- regressions- from- threshold- noise- —- is- the- one- that- determines- whether- your- team- trusts- the- CI- gate- or- learns- to- ignore- it.- Start- with- permissive- thresholds- and- tighten- them- as- you- build- confidence- in- the- signal.- A- gate- that- fails- on- every- minor- wording- change- trains- developers- to- bypass- it.- A- gate- that- only- fails- on- unambiguous- regressions- earns- their- trust.

##- Caveats- and- scope- boundaries

|- This- page- provides- process- guidance- for- adding- eval- gates- to- CI- pipelines.- It- is- not- a- tutorial- on- specific- eval- frameworks- (Promptfoo,- DeepEval,- OpenAI- Evals)- or- a- guarantee- of- safety. |- The- pattern- is- universal:- if- a- prompt- or- model- change- can- affect- user-facing- quality,- the- release- process- should- catch- it- before- the- change- ships. |- If- non-deterministic- scoring- becomes- reliable- enough- to- replace- human- judgment- on- individual- outputs,- or- if- model- providers- introduce- built-in- eval- gates- at- the- API- level,- the- trade-offs- described- here- may- shift.- As- of- May- 2026,- the- CI-based- approach- described- remains- the- practical- default.

##- Methodology

|- Data- checked:- 2026-05-28 |- Sources- consulted:- Promptfoo- GitHub- Actions- integration- docs,- DeepEval- CI- guide,- OpenAI- Evals- documentation,- GitHub- Actions- documentation,- Anthropic- evaluation- docs,- NIST- AI- RMF |- Assumptions:- The- reader’s- team- has- a- regular- release- cadence,- can- define- a- few- high-value- failure- modes,- and- uses- GitHub- Actions- or- a- comparable- CI- system |- Limitations:- This- article- provides- process- guidance,- not- an- absolute- safety- guarantee- or- a- benchmark- methodology.- Eval- CI- is- a- quality- practice,- not- a- replacement- for- human- review- on- safety-critical- features |- Jurisdiction:- Global.- NIST- AI- RMF- (US)- referenced

##- Source- list

|- Promptfoo- GitHub- Actions- integration- —- https://www.promptfoo.dev/docs/integrations/github-actions/- (accessed- 2026-05-28) |- DeepEval- GitHub- Actions- guide- —- https://docs.confident-ai.com/docs/github-actions-integration- (accessed- 2026-05-28) |- OpenAI- Evals- docs- —- https://platform.openai.com/dis/guides/evals- (accessed- 2026-05-28) |- GitHub- Actions- docs- —- https://docs.github.com/actions- (accessed- 2026-05-28) |- Anthropic- evaluation- docs- —- https://docs.anthropic.com/en/docs/evaluate- (accessed- 2026-05-28) |- NIST- AI- RMF- —- https://www.nist.gov/itl/ai-risk-management-framework- (accessed- 2026-05-28)

##- Related- guides- guides- guides

|- Golden- datasets- for- LLM- products:- how- small- regression- sets- prevent- regressions |- Eval- gaming:- when- models- optimise- for- the- test- rather- than- the- task |- Human-in-the-loop- AI:- approval- queues- that- do- not- become- bottlenecks

##- Trust- Stack

Last- checked:- 2026-05-28
Corrections:- Contact- us- to- report- errors

##- Change- log

|- 2026-05-28:- Full- editorial- review- against- 16-gate- checklist.- Added- 3- Editor’s- Note- asides.- Added- Methodology,- Source- list- with- access- dates,- Trust- Stack,- slugified- heading- IDs,- and- Caveats- section.- Fixed- frontmatter- writtenBy- label.- Folded- Global- applicability- into- Caveats.- Correct- to- related- guide- paths- to- relative- format. |- 2026-05-24:- First- draft- built- from- editorial- brief.- Revision- added- pipeline- pseudocode,- expanded- failure-mode- examples,- and- evidence-change- paragraph.