LLM Testing Lifecycle — Pre-Prod to Production¶

A complete view of the LLM/AI testing lifecycle: where each activity fits, what's tested at each stage, what tooling supports it, and what evidence comes out. Designed as a process companion to the metric-heavy llm-agent-evaluation-matrix.md.

1. The Lifecycle at a Glance¶

flowchart LR
    S1[📐 Requirements<br/>& Eval Spec] --> S2[🛠️ Develop<br/>& Unit Eval]
    S2 --> S3[🔬 Integration<br/>System Eval]
    S3 --> S4[🛡️ Pre-Release<br/>Hardening]
    S4 --> S5[🚀 Release<br/>Canary / Shadow / A/B]
    S5 --> S6[📊 Production<br/>Monitoring]
    S6 -.->|Incident → Test| S2
    S6 -.->|Drift detected| S3

    style S1 fill:#e3f2fd,stroke:#1976d2,color:#000
    style S2 fill:#e8f5e9,stroke:#2e7d32,color:#000
    style S3 fill:#fff3e0,stroke:#f57c00,color:#000
    style S4 fill:#fce4ec,stroke:#c2185b,color:#000
    style S5 fill:#f3e5f5,stroke:#7b1fa2,color:#000
    style S6 fill:#fff8e1,stroke:#fbc02d,color:#000

Each stage has its own owners, tooling, and evidence artefacts. Treating them as separate steps — not one giant "test" phase — is the discipline that separates mature AI engineering from demo-grade work.

2. Stage 1 — Requirements & Eval Specification¶

Before anyone writes a prompt. What does this feature have to do, and how will we know?

Activities¶

Capture functional requirements (what should the AI do)
Capture non-functional AI requirements (latency, cost, safety, refusal behaviour, citation behaviour)
Define the eval spec: what metrics, what dataset, what thresholds, what gates
Map requirements to OWASP LLM Top 10 + NIST AI RMF risk categories
Identify regulatory exposure (EU AI Act, GxP, DORA, FCA, GDPR)

Deliverables¶

Artefact	Purpose
Feature spec	What the AI does, in user terms
Eval spec	What we'll measure, against what, at what threshold
Risk map	Threat model + regulatory tags
Initial golden set definition	What categories of test cases we'll build (not the cases themselves yet)

Common mistakes here¶

Writing the prompt before the eval spec — leads to tests that just re-encode current behaviour
Skipping non-functional requirements — discovering you need < 2s p95 latency after picking a model that runs at 5s
Treating "good enough" as a feeling — the threshold must be a number derived from data

3. Stage 2 — Develop & Unit-Level Evaluation¶

Engineer-loop. Fast, cheap, runs on every change.

Activities¶

Build the feature: prompt + retrieval + tools + guardrails
Author golden-set cases covering happy path + edge cases (start small, ~20–50 cases per feature)
Wire unit-style evals into the developer loop — fast (< 1 min ideally), cheap, runnable on demand
Treat prompts as versioned code

Tooling¶

Layer	Tool
Test runner	pytest + DeepEval, Promptfoo CLI
Metrics	DeepEval / Ragas / custom
Judge model	A cheap-but-strong model (e.g. gpt-4o-mini class) — calibrated against humans
Trace / debug	Langfuse / LangSmith / local logging

What "unit-level" means for LLM systems¶

Not "unit testing the LLM" — that's nonsense. It means: - Per-prompt assertions — a specific prompt + input → expected behaviour property - Per-component assertions — retriever returns expected chunks, guardrail blocks expected input - Schema conformance — output validates against pydantic / JSON Schema

Anti-patterns¶

Golden set written after the prompt (tests encode current behaviour, not requirements)
All cases happy-path (real bugs are at edges)
LLM-as-judge with no human calibration sample
Tests that take 30 minutes to run (engineers stop running them)

4. Stage 3 — Integration & System Evaluation¶

The system as a whole, on realistic inputs. Run nightly or on merge to main.

Activities¶

Run the full eval suite against a representative dataset (100s–1000s of cases)
Compare metrics against baseline; alert on deltas
Include cross-component scenarios (RAG + agent + tools + guardrails together)
Capture cost, latency, and token-usage distributions

Layers of integration testing¶

Layer	What's Tested
RAG end-to-end	Retrieval + generation together; faithfulness + answer relevance
Agent end-to-end	Full multi-step workflows; trace assertions
Tool integration	Real tool backends (or VCR cassettes); schema; auth
Guardrail layer	Configured policies enforce; benign queries not over-blocked
Cross-cutting	Latency budgets, cost budgets, error handling

Deliverables¶

Nightly eval report with metrics + deltas vs previous run
Failure analysis: which categories regressed; which cases newly fail
Reproducibility metadata: model + prompt + dataset versions, run timestamps

5. Stage 4 — Pre-Release Hardening¶

The last gate before production. Adversarial. Comprehensive. Slower and more expensive than the developer loop.

Activities¶

Red-team / adversarial sweep — automated (DeepEval Red Teamer, Garak, PyRIT, Promptfoo red-team, AgentDojo) plus optional human red team
Performance & load testing — p95/p99 latency at expected concurrency; cost at scale
Safety gate — refuse-correctness across categorised adversarial corpora at the configured threshold (typically 100% on critical)
Compliance evidence pack — for regulated environments, produce versioned, signed evidence
Bias and fairness sweep — counterfactual templates across protected attributes

The release gate¶

Gate	Pass Criterion
Eval suite	All metrics ≥ baseline – tolerance; no critical regressions
Safety	100% pass on critical categories (jailbreak, PII, harmful content)
Performance	p95 latency ≤ SLA; cost per task within budget
Adversarial	Red-team findings tier-mapped; no open Critical / High
Compliance	Evidence pack complete and signed (for regulated work)

Deliverables¶

Eval report
Adversarial test report
Performance test report
Safety report
(Regulated) Validation pack mapping IQ / OQ / PQ or equivalent

6. Stage 5 — Release (Canary / Shadow / A/B)¶

Production deployment is part of testing, not the end of it.

Pattern: Shadow mode¶

Run the new version in parallel with the current production version. Both see real traffic. Only the production version's output reaches the user. Compare metrics on identical inputs.

Use when...	Strength	Watch-out
Risk is high and offline eval may not generalise	Real traffic, no user impact	Doubles inference cost

Pattern: Canary release¶

Roll the new version out to a small percentage of traffic (1% → 5% → 25% → 100%) over hours or days, with metric gates between stages.

Use when...	Strength	Watch-out
Want real user-impact data with limited blast radius	Catches issues offline eval misses	Requires fast metrics + automated rollback

Pattern: A/B test¶

Split traffic between control and treatment; measure business KPIs (deflection, CSAT, retention).

Use when...	Strength	Watch-out
Final go/no-go decision for a customer-facing change	Direct business signal	Needs traffic + statistical power; can take weeks

Pattern: Feature flag-gated release¶

The change ships dark; flag enables it per cohort. Decouples deploy from release.

Deliverables¶

Rollout plan with go/no-go criteria per stage
Automated rollback triggers (metric thresholds, error rates, user feedback signals)
Stage-by-stage observability

7. Stage 6 — Production Monitoring¶

Continuous. Where issues real users see actually live.

What to monitor¶

Signal Category	Examples
Quality	Faithfulness sampled on production traffic, refusal rate, hallucination probes
Performance	Latency distribution, throughput, error rates
Cost	Tokens per request, cost per task, model-tier usage
User signals	Thumbs-up/down, regenerate clicks, conversation length, escalation rate
Drift	Input distribution shift (embedding-based or feature-based)
Safety	Refusal rate vs baseline; flagged content rate
Operational	API rate-limit hits, retry rate, tool-failure rate

Continuous evaluation patterns¶

Sampled re-evaluation — periodically pull 100–1000 production requests, run them through the eval suite, alert on score drift
Online LLM-as-judge — judge a small percent of live traffic and surface low-scoring conversations for review
User-feedback loop — every thumbs-down enters a triage queue; recurring patterns become eval cases
Incident → regression — every production incident becomes a permanent test case in the suite

Tooling¶

Layer	Tool
Tracing	Langfuse, LangSmith, Arize Phoenix, Helicone
Alerting	Datadog, Grafana, native cloud monitoring
Drift detection	Arize, Evidently, custom
Feedback loop	LangSmith / Braintrust / in-house annotation

Deliverables¶

Live dashboards (engineering + product views)
Alert configuration with documented thresholds
Weekly / monthly quality reports
Incident postmortems linked to regression tests

8. Process Matrix — Activity by Stage¶

Quick reference: which activities live in which stage.

Activity	Spec	Dev	Integ	Pre-Rel	Release	Prod
Define metrics & thresholds	✓
Build golden dataset	✓	✓	✓			✓
Author unit-style evals		✓
Run full eval suite			✓	✓		sampled
RAG retrieval eval		✓	✓	✓		sampled
Agent trace assertions		✓	✓	✓		sampled
Red-team automated			(smoke)	✓		(periodic)
Red-team human				✓		(per release)
Bias / fairness sweep				✓		(periodic)
Performance / load				✓	✓	✓
Cost analysis		✓	✓	✓	✓	✓
Safety gate		(basic)	✓	✓	✓	✓
A/B testing					✓	✓
Shadow mode					✓
Canary release					✓
Drift detection						✓
User-feedback loop						✓
Evidence pack			(snapshot)	✓	✓	(refresh)

9. Failure-Loop Matrix — Where Issues Get Caught (or Don't)¶

Failure	Caught at Spec	Caught at Dev	Caught at Integ	Caught at Pre-Rel	Caught at Prod
Wrong requirement	✓	(late)	(late)	(late)	$$$$
Prompt regression		✓	✓	✓	$$
Retrieval quality drop		(if tested)	✓	✓	$$
Tool integration bug		✓	✓	✓	$$
Latency creep			✓	✓	$$$
Cost blow-up		(limited)	✓	✓	$$$$
Prompt injection (direct)			✓	✓✓	$$$
Prompt injection (indirect)			(if tested)	✓✓	$$$$
Harmful content			✓	✓✓	$$$$
Bias			(limited)	✓	$$$$
Drift				(snapshot)	✓
User-experience issue			(limited)	(limited)	✓

$ = relative cost of catching it at that stage. Right-shift = expensive.

Interview line: "The shift-left case for AI eval is the same as any other QA — except the right-shift cost is higher because incidents in AI are public, viral, and often regulatory."

10. CI/CD Integration Patterns¶

Where each test lives in the pipeline¶

Pipeline Stage	Tests Run	Latency Budget
Pre-commit hook	Schema validation, lightweight unit evals (5–10 cases)	< 30s
PR / merge to feature branch	Full unit evals on changed feature (~50 cases)	< 5 min
Merge to main	Full unit + integration suite + smoke red-team	< 20 min
Nightly	Full eval suite, full red-team, performance benchmark	< 4 hours
Pre-release branch	Everything above + manual red-team + bias sweep	hours–days
Production	Continuous sampled re-eval + drift monitoring	streaming

Sample GitHub Actions skeleton¶

```yaml name: ai-eval on: [pull_request, schedule] jobs: unit-eval: if: github.event_name == 'pull_request' runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: pip install -r requirements.txt - run: pytest tests/eval/unit -x --tb=short

full-eval: if: github.event_name == 'schedule' runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: pip install -r requirements.txt - run: pytest tests/eval/integration - run: deepeval test run tests/red_team - uses: actions/upload-artifact@v4 with: name: eval-report path: reports/ ```

11. Lifecycle Q&A — Interview-Ready¶

Q: Where does most of your testing effort go in an AI project? Unit-level evals during development (fast loop) and continuous production monitoring (real signal). Pre-release red-team is essential but bounded; the daily compound effect comes from the fast loop and the production loop.
Q: How do you keep eval costs under control? Tiered judges — cheap judge for the dev loop, stronger judge for nightly + pre-release. Sample production rather than evaluating every request. Cache eval results against versioned model+prompt+dataset hashes. Batch where possible.
Q: How do you handle the fact that eval datasets get stale? Treat the dataset as living. Every production incident enters the suite. Quarterly review removes stale cases. Drift signals from prod monitoring trigger new case authoring. The dataset is versioned; older versions stay queryable for historical comparison.
Q: When would you choose canary over shadow? Shadow when the cost of being wrong is high and we need real-traffic comparison data without user impact. Canary when we're confident enough to expose users but want bounded blast radius and progressive confidence build-up.
Q: What goes into a release decision for an AI feature? Eval suite green vs baseline + tolerance, safety gates at 100% on critical, performance & cost within budget, red-team findings tier-mapped with no open Critical/High, documented residual-risk waiver if any, rollback plan rehearsed.
Q: How do you operationalise "every incident becomes a regression test"? Every prod incident triages into either a metric to add, a test case to add, or a coverage gap to fill. The postmortem includes a "regression test ID" as a required field — without it the postmortem isn't closed.
Q: How do you handle a model upgrade in production? Full offline eval first; investigate metric deltas beyond tolerance. Then shadow mode for a comparable load period. Then canary at 1% → 5% → 25% → 100% with metric gates between stages. Automated rollback on any safety metric regression or user-feedback delta.

12. Cross-References¶

Metrics deep dive & matrices → llm-agent-evaluation-matrix.md
RAG eval specifics → ragas-faq.md
pytest-style framework → deepeval-faq.md
Red-team theory → red-blue-purple-team-ai-faq.md
MCP testing process → mcp-testing-roadmap.md
MCP protocol → mcp-servers-faq.md
Platform context → enterprise-llm-platforms.md
Tool landscape → commercial-llm-mcp-testing-tools.md
Architecture context → rag-vs-agents-vs-agentic-rag.md

13. Master Interview Sound-Bites¶

"Testing an AI feature isn't one phase — it's six. Spec, dev, integration, pre-release, release, production. Each has its own owners, tooling, and evidence. Treating them as one undifferentiated 'test phase' is the most common immaturity sign."
"Production isn't where testing ends — it's where the highest-signal testing starts. Sampled re-eval, drift detection, and user-feedback loops catch what offline eval can't see."
"Every production incident becomes a permanent regression test. The postmortem isn't closed without the test ID. That's how the suite stays alive."
"Shadow for high-risk parity. Canary for confidence-building rollout. A/B for the final business decision. Choose deliberately — they answer different questions."
"The cost of catching an issue right-shifts brutally for AI. A spec-stage fix is hours; a production-stage incident is a regulator notification."