LLM Testing Lifecycle — Pre-Prod to Production¶
A complete view of the LLM/AI testing lifecycle: where each activity fits, what's tested at each stage, what tooling supports it, and what evidence comes out. Designed as a process companion to the metric-heavy
llm-agent-evaluation-matrix.md.
1. The Lifecycle at a Glance¶
flowchart LR
S1[📐 Requirements<br/>& Eval Spec] --> S2[🛠️ Develop<br/>& Unit Eval]
S2 --> S3[🔬 Integration<br/>System Eval]
S3 --> S4[🛡️ Pre-Release<br/>Hardening]
S4 --> S5[🚀 Release<br/>Canary / Shadow / A/B]
S5 --> S6[📊 Production<br/>Monitoring]
S6 -.->|Incident → Test| S2
S6 -.->|Drift detected| S3
style S1 fill:#e3f2fd,stroke:#1976d2,color:#000
style S2 fill:#e8f5e9,stroke:#2e7d32,color:#000
style S3 fill:#fff3e0,stroke:#f57c00,color:#000
style S4 fill:#fce4ec,stroke:#c2185b,color:#000
style S5 fill:#f3e5f5,stroke:#7b1fa2,color:#000
style S6 fill:#fff8e1,stroke:#fbc02d,color:#000
Each stage has its own owners, tooling, and evidence artefacts. Treating them as separate steps — not one giant "test" phase — is the discipline that separates mature AI engineering from demo-grade work.
2. Stage 1 — Requirements & Eval Specification¶
Before anyone writes a prompt. What does this feature have to do, and how will we know?
Activities¶
- Capture functional requirements (what should the AI do)
- Capture non-functional AI requirements (latency, cost, safety, refusal behaviour, citation behaviour)
- Define the eval spec: what metrics, what dataset, what thresholds, what gates
- Map requirements to OWASP LLM Top 10 + NIST AI RMF risk categories
- Identify regulatory exposure (EU AI Act, GxP, DORA, FCA, GDPR)
Deliverables¶
| Artefact | Purpose |
|---|---|
| Feature spec | What the AI does, in user terms |
| Eval spec | What we'll measure, against what, at what threshold |
| Risk map | Threat model + regulatory tags |
| Initial golden set definition | What categories of test cases we'll build (not the cases themselves yet) |
Common mistakes here¶
- Writing the prompt before the eval spec — leads to tests that just re-encode current behaviour
- Skipping non-functional requirements — discovering you need < 2s p95 latency after picking a model that runs at 5s
- Treating "good enough" as a feeling — the threshold must be a number derived from data
3. Stage 2 — Develop & Unit-Level Evaluation¶
Engineer-loop. Fast, cheap, runs on every change.
Activities¶
- Build the feature: prompt + retrieval + tools + guardrails
- Author golden-set cases covering happy path + edge cases (start small, ~20–50 cases per feature)
- Wire unit-style evals into the developer loop — fast (< 1 min ideally), cheap, runnable on demand
- Treat prompts as versioned code
Tooling¶
| Layer | Tool |
|---|---|
| Test runner | pytest + DeepEval, Promptfoo CLI |
| Metrics | DeepEval / Ragas / custom |
| Judge model | A cheap-but-strong model (e.g. gpt-4o-mini class) — calibrated against humans |
| Trace / debug | Langfuse / LangSmith / local logging |
What "unit-level" means for LLM systems¶
Not "unit testing the LLM" — that's nonsense. It means: - Per-prompt assertions — a specific prompt + input → expected behaviour property - Per-component assertions — retriever returns expected chunks, guardrail blocks expected input - Schema conformance — output validates against pydantic / JSON Schema
Anti-patterns¶
- Golden set written after the prompt (tests encode current behaviour, not requirements)
- All cases happy-path (real bugs are at edges)
- LLM-as-judge with no human calibration sample
- Tests that take 30 minutes to run (engineers stop running them)
4. Stage 3 — Integration & System Evaluation¶
The system as a whole, on realistic inputs. Run nightly or on merge to main.
Activities¶
- Run the full eval suite against a representative dataset (100s–1000s of cases)
- Compare metrics against baseline; alert on deltas
- Include cross-component scenarios (RAG + agent + tools + guardrails together)
- Capture cost, latency, and token-usage distributions
Layers of integration testing¶
| Layer | What's Tested |
|---|---|
| RAG end-to-end | Retrieval + generation together; faithfulness + answer relevance |
| Agent end-to-end | Full multi-step workflows; trace assertions |
| Tool integration | Real tool backends (or VCR cassettes); schema; auth |
| Guardrail layer | Configured policies enforce; benign queries not over-blocked |
| Cross-cutting | Latency budgets, cost budgets, error handling |
Deliverables¶
- Nightly eval report with metrics + deltas vs previous run
- Failure analysis: which categories regressed; which cases newly fail
- Reproducibility metadata: model + prompt + dataset versions, run timestamps
5. Stage 4 — Pre-Release Hardening¶
The last gate before production. Adversarial. Comprehensive. Slower and more expensive than the developer loop.
Activities¶
- Red-team / adversarial sweep — automated (DeepEval Red Teamer, Garak, PyRIT, Promptfoo red-team, AgentDojo) plus optional human red team
- Performance & load testing — p95/p99 latency at expected concurrency; cost at scale
- Safety gate — refuse-correctness across categorised adversarial corpora at the configured threshold (typically 100% on critical)
- Compliance evidence pack — for regulated environments, produce versioned, signed evidence
- Bias and fairness sweep — counterfactual templates across protected attributes
The release gate¶
| Gate | Pass Criterion |
|---|---|
| Eval suite | All metrics ≥ baseline – tolerance; no critical regressions |
| Safety | 100% pass on critical categories (jailbreak, PII, harmful content) |
| Performance | p95 latency ≤ SLA; cost per task within budget |
| Adversarial | Red-team findings tier-mapped; no open Critical / High |
| Compliance | Evidence pack complete and signed (for regulated work) |
Deliverables¶
- Eval report
- Adversarial test report
- Performance test report
- Safety report
- (Regulated) Validation pack mapping IQ / OQ / PQ or equivalent
6. Stage 5 — Release (Canary / Shadow / A/B)¶
Production deployment is part of testing, not the end of it.
Pattern: Shadow mode¶
Run the new version in parallel with the current production version. Both see real traffic. Only the production version's output reaches the user. Compare metrics on identical inputs.
| Use when... | Strength | Watch-out |
|---|---|---|
| Risk is high and offline eval may not generalise | Real traffic, no user impact | Doubles inference cost |
Pattern: Canary release¶
Roll the new version out to a small percentage of traffic (1% → 5% → 25% → 100%) over hours or days, with metric gates between stages.
| Use when... | Strength | Watch-out |
|---|---|---|
| Want real user-impact data with limited blast radius | Catches issues offline eval misses | Requires fast metrics + automated rollback |
Pattern: A/B test¶
Split traffic between control and treatment; measure business KPIs (deflection, CSAT, retention).
| Use when... | Strength | Watch-out |
|---|---|---|
| Final go/no-go decision for a customer-facing change | Direct business signal | Needs traffic + statistical power; can take weeks |
Pattern: Feature flag-gated release¶
The change ships dark; flag enables it per cohort. Decouples deploy from release.
Deliverables¶
- Rollout plan with go/no-go criteria per stage
- Automated rollback triggers (metric thresholds, error rates, user feedback signals)
- Stage-by-stage observability
7. Stage 6 — Production Monitoring¶
Continuous. Where issues real users see actually live.
What to monitor¶
| Signal Category | Examples |
|---|---|
| Quality | Faithfulness sampled on production traffic, refusal rate, hallucination probes |
| Performance | Latency distribution, throughput, error rates |
| Cost | Tokens per request, cost per task, model-tier usage |
| User signals | Thumbs-up/down, regenerate clicks, conversation length, escalation rate |
| Drift | Input distribution shift (embedding-based or feature-based) |
| Safety | Refusal rate vs baseline; flagged content rate |
| Operational | API rate-limit hits, retry rate, tool-failure rate |
Continuous evaluation patterns¶
- Sampled re-evaluation — periodically pull 100–1000 production requests, run them through the eval suite, alert on score drift
- Online LLM-as-judge — judge a small percent of live traffic and surface low-scoring conversations for review
- User-feedback loop — every thumbs-down enters a triage queue; recurring patterns become eval cases
- Incident → regression — every production incident becomes a permanent test case in the suite
Tooling¶
| Layer | Tool |
|---|---|
| Tracing | Langfuse, LangSmith, Arize Phoenix, Helicone |
| Alerting | Datadog, Grafana, native cloud monitoring |
| Drift detection | Arize, Evidently, custom |
| Feedback loop | LangSmith / Braintrust / in-house annotation |
Deliverables¶
- Live dashboards (engineering + product views)
- Alert configuration with documented thresholds
- Weekly / monthly quality reports
- Incident postmortems linked to regression tests
8. Process Matrix — Activity by Stage¶
Quick reference: which activities live in which stage.
| Activity | Spec | Dev | Integ | Pre-Rel | Release | Prod |
|---|---|---|---|---|---|---|
| Define metrics & thresholds | ✓ | |||||
| Build golden dataset | ✓ | ✓ | ✓ | ✓ | ||
| Author unit-style evals | ✓ | |||||
| Run full eval suite | ✓ | ✓ | sampled | |||
| RAG retrieval eval | ✓ | ✓ | ✓ | sampled | ||
| Agent trace assertions | ✓ | ✓ | ✓ | sampled | ||
| Red-team automated | (smoke) | ✓ | (periodic) | |||
| Red-team human | ✓ | (per release) | ||||
| Bias / fairness sweep | ✓ | (periodic) | ||||
| Performance / load | ✓ | ✓ | ✓ | |||
| Cost analysis | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Safety gate | (basic) | ✓ | ✓ | ✓ | ✓ | |
| A/B testing | ✓ | ✓ | ||||
| Shadow mode | ✓ | |||||
| Canary release | ✓ | |||||
| Drift detection | ✓ | |||||
| User-feedback loop | ✓ | |||||
| Evidence pack | (snapshot) | ✓ | ✓ | (refresh) |
9. Failure-Loop Matrix — Where Issues Get Caught (or Don't)¶
| Failure | Caught at Spec | Caught at Dev | Caught at Integ | Caught at Pre-Rel | Caught at Prod |
|---|---|---|---|---|---|
| Wrong requirement | ✓ | (late) | (late) | (late) | $$$$ |
| Prompt regression | ✓ | ✓ | ✓ | $$ | |
| Retrieval quality drop | (if tested) | ✓ | ✓ | $$ | |
| Tool integration bug | ✓ | ✓ | ✓ | $$ | |
| Latency creep | ✓ | ✓ | $$$ | ||
| Cost blow-up | (limited) | ✓ | ✓ | $$$$ | |
| Prompt injection (direct) | ✓ | ✓✓ | $$$ | ||
| Prompt injection (indirect) | (if tested) | ✓✓ | $$$$ | ||
| Harmful content | ✓ | ✓✓ | $$$$ | ||
| Bias | (limited) | ✓ | $$$$ | ||
| Drift | (snapshot) | ✓ | |||
| User-experience issue | (limited) | (limited) | ✓ |
$ = relative cost of catching it at that stage. Right-shift = expensive.
Interview line: "The shift-left case for AI eval is the same as any other QA — except the right-shift cost is higher because incidents in AI are public, viral, and often regulatory."
10. CI/CD Integration Patterns¶
Where each test lives in the pipeline¶
| Pipeline Stage | Tests Run | Latency Budget |
|---|---|---|
| Pre-commit hook | Schema validation, lightweight unit evals (5–10 cases) | < 30s |
| PR / merge to feature branch | Full unit evals on changed feature (~50 cases) | < 5 min |
| Merge to main | Full unit + integration suite + smoke red-team | < 20 min |
| Nightly | Full eval suite, full red-team, performance benchmark | < 4 hours |
| Pre-release branch | Everything above + manual red-team + bias sweep | hours–days |
| Production | Continuous sampled re-eval + drift monitoring | streaming |
Sample GitHub Actions skeleton¶
```yaml name: ai-eval on: [pull_request, schedule] jobs: unit-eval: if: github.event_name == 'pull_request' runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: pip install -r requirements.txt - run: pytest tests/eval/unit -x --tb=short
full-eval: if: github.event_name == 'schedule' runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: pip install -r requirements.txt - run: pytest tests/eval/integration - run: deepeval test run tests/red_team - uses: actions/upload-artifact@v4 with: name: eval-report path: reports/ ```
11. Lifecycle Q&A — Interview-Ready¶
-
Q: Where does most of your testing effort go in an AI project? Unit-level evals during development (fast loop) and continuous production monitoring (real signal). Pre-release red-team is essential but bounded; the daily compound effect comes from the fast loop and the production loop.
-
Q: How do you keep eval costs under control? Tiered judges — cheap judge for the dev loop, stronger judge for nightly + pre-release. Sample production rather than evaluating every request. Cache eval results against versioned model+prompt+dataset hashes. Batch where possible.
-
Q: How do you handle the fact that eval datasets get stale? Treat the dataset as living. Every production incident enters the suite. Quarterly review removes stale cases. Drift signals from prod monitoring trigger new case authoring. The dataset is versioned; older versions stay queryable for historical comparison.
-
Q: When would you choose canary over shadow? Shadow when the cost of being wrong is high and we need real-traffic comparison data without user impact. Canary when we're confident enough to expose users but want bounded blast radius and progressive confidence build-up.
-
Q: What goes into a release decision for an AI feature? Eval suite green vs baseline + tolerance, safety gates at 100% on critical, performance & cost within budget, red-team findings tier-mapped with no open Critical/High, documented residual-risk waiver if any, rollback plan rehearsed.
-
Q: How do you operationalise "every incident becomes a regression test"? Every prod incident triages into either a metric to add, a test case to add, or a coverage gap to fill. The postmortem includes a "regression test ID" as a required field — without it the postmortem isn't closed.
-
Q: How do you handle a model upgrade in production? Full offline eval first; investigate metric deltas beyond tolerance. Then shadow mode for a comparable load period. Then canary at 1% → 5% → 25% → 100% with metric gates between stages. Automated rollback on any safety metric regression or user-feedback delta.
12. Cross-References¶
- Metrics deep dive & matrices →
llm-agent-evaluation-matrix.md - RAG eval specifics →
ragas-faq.md - pytest-style framework →
deepeval-faq.md - Red-team theory →
red-blue-purple-team-ai-faq.md - MCP testing process →
mcp-testing-roadmap.md - MCP protocol →
mcp-servers-faq.md - Platform context →
enterprise-llm-platforms.md - Tool landscape →
commercial-llm-mcp-testing-tools.md - Architecture context →
rag-vs-agents-vs-agentic-rag.md
13. Master Interview Sound-Bites¶
- "Testing an AI feature isn't one phase — it's six. Spec, dev, integration, pre-release, release, production. Each has its own owners, tooling, and evidence. Treating them as one undifferentiated 'test phase' is the most common immaturity sign."
- "Production isn't where testing ends — it's where the highest-signal testing starts. Sampled re-eval, drift detection, and user-feedback loops catch what offline eval can't see."
- "Every production incident becomes a permanent regression test. The postmortem isn't closed without the test ID. That's how the suite stays alive."
- "Shadow for high-risk parity. Canary for confidence-building rollout. A/B for the final business decision. Choose deliberately — they answer different questions."
- "The cost of catching an issue right-shifts brutally for AI. A spec-stage fix is hours; a production-stage incident is a regulator notification."