Skip to content

LLM Testing Lifecycle — Pre-Prod to Production

A complete view of the LLM/AI testing lifecycle: where each activity fits, what's tested at each stage, what tooling supports it, and what evidence comes out. Designed as a process companion to the metric-heavy llm-agent-evaluation-matrix.md.


1. The Lifecycle at a Glance

flowchart LR
    S1[📐 Requirements<br/>& Eval Spec] --> S2[🛠️ Develop<br/>& Unit Eval]
    S2 --> S3[🔬 Integration<br/>System Eval]
    S3 --> S4[🛡️ Pre-Release<br/>Hardening]
    S4 --> S5[🚀 Release<br/>Canary / Shadow / A/B]
    S5 --> S6[📊 Production<br/>Monitoring]
    S6 -.->|Incident → Test| S2
    S6 -.->|Drift detected| S3

    style S1 fill:#e3f2fd,stroke:#1976d2,color:#000
    style S2 fill:#e8f5e9,stroke:#2e7d32,color:#000
    style S3 fill:#fff3e0,stroke:#f57c00,color:#000
    style S4 fill:#fce4ec,stroke:#c2185b,color:#000
    style S5 fill:#f3e5f5,stroke:#7b1fa2,color:#000
    style S6 fill:#fff8e1,stroke:#fbc02d,color:#000

Each stage has its own owners, tooling, and evidence artefacts. Treating them as separate steps — not one giant "test" phase — is the discipline that separates mature AI engineering from demo-grade work.


2. Stage 1 — Requirements & Eval Specification

Before anyone writes a prompt. What does this feature have to do, and how will we know?

Activities

  • Capture functional requirements (what should the AI do)
  • Capture non-functional AI requirements (latency, cost, safety, refusal behaviour, citation behaviour)
  • Define the eval spec: what metrics, what dataset, what thresholds, what gates
  • Map requirements to OWASP LLM Top 10 + NIST AI RMF risk categories
  • Identify regulatory exposure (EU AI Act, GxP, DORA, FCA, GDPR)

Deliverables

Artefact Purpose
Feature spec What the AI does, in user terms
Eval spec What we'll measure, against what, at what threshold
Risk map Threat model + regulatory tags
Initial golden set definition What categories of test cases we'll build (not the cases themselves yet)

Common mistakes here

  • Writing the prompt before the eval spec — leads to tests that just re-encode current behaviour
  • Skipping non-functional requirements — discovering you need < 2s p95 latency after picking a model that runs at 5s
  • Treating "good enough" as a feeling — the threshold must be a number derived from data

3. Stage 2 — Develop & Unit-Level Evaluation

Engineer-loop. Fast, cheap, runs on every change.

Activities

  • Build the feature: prompt + retrieval + tools + guardrails
  • Author golden-set cases covering happy path + edge cases (start small, ~20–50 cases per feature)
  • Wire unit-style evals into the developer loop — fast (< 1 min ideally), cheap, runnable on demand
  • Treat prompts as versioned code

Tooling

Layer Tool
Test runner pytest + DeepEval, Promptfoo CLI
Metrics DeepEval / Ragas / custom
Judge model A cheap-but-strong model (e.g. gpt-4o-mini class) — calibrated against humans
Trace / debug Langfuse / LangSmith / local logging

What "unit-level" means for LLM systems

Not "unit testing the LLM" — that's nonsense. It means: - Per-prompt assertions — a specific prompt + input → expected behaviour property - Per-component assertions — retriever returns expected chunks, guardrail blocks expected input - Schema conformance — output validates against pydantic / JSON Schema

Anti-patterns

  • Golden set written after the prompt (tests encode current behaviour, not requirements)
  • All cases happy-path (real bugs are at edges)
  • LLM-as-judge with no human calibration sample
  • Tests that take 30 minutes to run (engineers stop running them)

4. Stage 3 — Integration & System Evaluation

The system as a whole, on realistic inputs. Run nightly or on merge to main.

Activities

  • Run the full eval suite against a representative dataset (100s–1000s of cases)
  • Compare metrics against baseline; alert on deltas
  • Include cross-component scenarios (RAG + agent + tools + guardrails together)
  • Capture cost, latency, and token-usage distributions

Layers of integration testing

Layer What's Tested
RAG end-to-end Retrieval + generation together; faithfulness + answer relevance
Agent end-to-end Full multi-step workflows; trace assertions
Tool integration Real tool backends (or VCR cassettes); schema; auth
Guardrail layer Configured policies enforce; benign queries not over-blocked
Cross-cutting Latency budgets, cost budgets, error handling

Deliverables

  • Nightly eval report with metrics + deltas vs previous run
  • Failure analysis: which categories regressed; which cases newly fail
  • Reproducibility metadata: model + prompt + dataset versions, run timestamps

5. Stage 4 — Pre-Release Hardening

The last gate before production. Adversarial. Comprehensive. Slower and more expensive than the developer loop.

Activities

  • Red-team / adversarial sweep — automated (DeepEval Red Teamer, Garak, PyRIT, Promptfoo red-team, AgentDojo) plus optional human red team
  • Performance & load testing — p95/p99 latency at expected concurrency; cost at scale
  • Safety gate — refuse-correctness across categorised adversarial corpora at the configured threshold (typically 100% on critical)
  • Compliance evidence pack — for regulated environments, produce versioned, signed evidence
  • Bias and fairness sweep — counterfactual templates across protected attributes

The release gate

Gate Pass Criterion
Eval suite All metrics ≥ baseline – tolerance; no critical regressions
Safety 100% pass on critical categories (jailbreak, PII, harmful content)
Performance p95 latency ≤ SLA; cost per task within budget
Adversarial Red-team findings tier-mapped; no open Critical / High
Compliance Evidence pack complete and signed (for regulated work)

Deliverables

  • Eval report
  • Adversarial test report
  • Performance test report
  • Safety report
  • (Regulated) Validation pack mapping IQ / OQ / PQ or equivalent

6. Stage 5 — Release (Canary / Shadow / A/B)

Production deployment is part of testing, not the end of it.

Pattern: Shadow mode

Run the new version in parallel with the current production version. Both see real traffic. Only the production version's output reaches the user. Compare metrics on identical inputs.

Use when... Strength Watch-out
Risk is high and offline eval may not generalise Real traffic, no user impact Doubles inference cost

Pattern: Canary release

Roll the new version out to a small percentage of traffic (1% → 5% → 25% → 100%) over hours or days, with metric gates between stages.

Use when... Strength Watch-out
Want real user-impact data with limited blast radius Catches issues offline eval misses Requires fast metrics + automated rollback

Pattern: A/B test

Split traffic between control and treatment; measure business KPIs (deflection, CSAT, retention).

Use when... Strength Watch-out
Final go/no-go decision for a customer-facing change Direct business signal Needs traffic + statistical power; can take weeks

Pattern: Feature flag-gated release

The change ships dark; flag enables it per cohort. Decouples deploy from release.

Deliverables

  • Rollout plan with go/no-go criteria per stage
  • Automated rollback triggers (metric thresholds, error rates, user feedback signals)
  • Stage-by-stage observability

7. Stage 6 — Production Monitoring

Continuous. Where issues real users see actually live.

What to monitor

Signal Category Examples
Quality Faithfulness sampled on production traffic, refusal rate, hallucination probes
Performance Latency distribution, throughput, error rates
Cost Tokens per request, cost per task, model-tier usage
User signals Thumbs-up/down, regenerate clicks, conversation length, escalation rate
Drift Input distribution shift (embedding-based or feature-based)
Safety Refusal rate vs baseline; flagged content rate
Operational API rate-limit hits, retry rate, tool-failure rate

Continuous evaluation patterns

  • Sampled re-evaluation — periodically pull 100–1000 production requests, run them through the eval suite, alert on score drift
  • Online LLM-as-judge — judge a small percent of live traffic and surface low-scoring conversations for review
  • User-feedback loop — every thumbs-down enters a triage queue; recurring patterns become eval cases
  • Incident → regression — every production incident becomes a permanent test case in the suite

Tooling

Layer Tool
Tracing Langfuse, LangSmith, Arize Phoenix, Helicone
Alerting Datadog, Grafana, native cloud monitoring
Drift detection Arize, Evidently, custom
Feedback loop LangSmith / Braintrust / in-house annotation

Deliverables

  • Live dashboards (engineering + product views)
  • Alert configuration with documented thresholds
  • Weekly / monthly quality reports
  • Incident postmortems linked to regression tests

8. Process Matrix — Activity by Stage

Quick reference: which activities live in which stage.

Activity Spec Dev Integ Pre-Rel Release Prod
Define metrics & thresholds
Build golden dataset
Author unit-style evals
Run full eval suite sampled
RAG retrieval eval sampled
Agent trace assertions sampled
Red-team automated (smoke) (periodic)
Red-team human (per release)
Bias / fairness sweep (periodic)
Performance / load
Cost analysis
Safety gate (basic)
A/B testing
Shadow mode
Canary release
Drift detection
User-feedback loop
Evidence pack (snapshot) (refresh)

9. Failure-Loop Matrix — Where Issues Get Caught (or Don't)

Failure Caught at Spec Caught at Dev Caught at Integ Caught at Pre-Rel Caught at Prod
Wrong requirement (late) (late) (late) $$$$
Prompt regression $$
Retrieval quality drop (if tested) $$
Tool integration bug $$
Latency creep $$$
Cost blow-up (limited) $$$$
Prompt injection (direct) ✓✓ $$$
Prompt injection (indirect) (if tested) ✓✓ $$$$
Harmful content ✓✓ $$$$
Bias (limited) $$$$
Drift (snapshot)
User-experience issue (limited) (limited)

$ = relative cost of catching it at that stage. Right-shift = expensive.

Interview line: "The shift-left case for AI eval is the same as any other QA — except the right-shift cost is higher because incidents in AI are public, viral, and often regulatory."


10. CI/CD Integration Patterns

Where each test lives in the pipeline

Pipeline Stage Tests Run Latency Budget
Pre-commit hook Schema validation, lightweight unit evals (5–10 cases) < 30s
PR / merge to feature branch Full unit evals on changed feature (~50 cases) < 5 min
Merge to main Full unit + integration suite + smoke red-team < 20 min
Nightly Full eval suite, full red-team, performance benchmark < 4 hours
Pre-release branch Everything above + manual red-team + bias sweep hours–days
Production Continuous sampled re-eval + drift monitoring streaming

Sample GitHub Actions skeleton

```yaml name: ai-eval on: [pull_request, schedule] jobs: unit-eval: if: github.event_name == 'pull_request' runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: pip install -r requirements.txt - run: pytest tests/eval/unit -x --tb=short

full-eval: if: github.event_name == 'schedule' runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: pip install -r requirements.txt - run: pytest tests/eval/integration - run: deepeval test run tests/red_team - uses: actions/upload-artifact@v4 with: name: eval-report path: reports/ ```


11. Lifecycle Q&A — Interview-Ready

  • Q: Where does most of your testing effort go in an AI project? Unit-level evals during development (fast loop) and continuous production monitoring (real signal). Pre-release red-team is essential but bounded; the daily compound effect comes from the fast loop and the production loop.

  • Q: How do you keep eval costs under control? Tiered judges — cheap judge for the dev loop, stronger judge for nightly + pre-release. Sample production rather than evaluating every request. Cache eval results against versioned model+prompt+dataset hashes. Batch where possible.

  • Q: How do you handle the fact that eval datasets get stale? Treat the dataset as living. Every production incident enters the suite. Quarterly review removes stale cases. Drift signals from prod monitoring trigger new case authoring. The dataset is versioned; older versions stay queryable for historical comparison.

  • Q: When would you choose canary over shadow? Shadow when the cost of being wrong is high and we need real-traffic comparison data without user impact. Canary when we're confident enough to expose users but want bounded blast radius and progressive confidence build-up.

  • Q: What goes into a release decision for an AI feature? Eval suite green vs baseline + tolerance, safety gates at 100% on critical, performance & cost within budget, red-team findings tier-mapped with no open Critical/High, documented residual-risk waiver if any, rollback plan rehearsed.

  • Q: How do you operationalise "every incident becomes a regression test"? Every prod incident triages into either a metric to add, a test case to add, or a coverage gap to fill. The postmortem includes a "regression test ID" as a required field — without it the postmortem isn't closed.

  • Q: How do you handle a model upgrade in production? Full offline eval first; investigate metric deltas beyond tolerance. Then shadow mode for a comparable load period. Then canary at 1% → 5% → 25% → 100% with metric gates between stages. Automated rollback on any safety metric regression or user-feedback delta.


12. Cross-References

  • Metrics deep dive & matricesllm-agent-evaluation-matrix.md
  • RAG eval specificsragas-faq.md
  • pytest-style frameworkdeepeval-faq.md
  • Red-team theoryred-blue-purple-team-ai-faq.md
  • MCP testing processmcp-testing-roadmap.md
  • MCP protocolmcp-servers-faq.md
  • Platform contextenterprise-llm-platforms.md
  • Tool landscapecommercial-llm-mcp-testing-tools.md
  • Architecture contextrag-vs-agents-vs-agentic-rag.md

13. Master Interview Sound-Bites

  • "Testing an AI feature isn't one phase — it's six. Spec, dev, integration, pre-release, release, production. Each has its own owners, tooling, and evidence. Treating them as one undifferentiated 'test phase' is the most common immaturity sign."
  • "Production isn't where testing ends — it's where the highest-signal testing starts. Sampled re-eval, drift detection, and user-feedback loops catch what offline eval can't see."
  • "Every production incident becomes a permanent regression test. The postmortem isn't closed without the test ID. That's how the suite stays alive."
  • "Shadow for high-risk parity. Canary for confidence-building rollout. A/B for the final business decision. Choose deliberately — they answer different questions."
  • "The cost of catching an issue right-shifts brutally for AI. A spec-stage fix is hours; a production-stage incident is a regulator notification."