LLM & Agent Evaluation Matrix¶
The master reference for evaluating LLMs, RAG systems, and AI agents. Built as matrices — easy to scan, easy to deliver in an interview. Use this when you need to pick a metric, justify a metric, or explain why a metric isn't enough on its own.
1. The Evaluation Landscape — Three Layers¶
Most candidates conflate these three. Separating them is a fast signal of depth.
| Layer | What's Being Evaluated | Typical Metrics |
|---|---|---|
| Model evaluation | The model itself, in isolation | Benchmarks (MMLU, HumanEval, GSM8K, ARC), perplexity, calibration |
| System evaluation | The model + prompt + retrieval + tools + guardrails as a whole | Faithfulness, answer relevance, hallucination rate, refusal rate, citation accuracy |
| Product / outcome evaluation | Did the system help the user achieve their goal? | Task success rate, time-to-resolution, deflection rate, CSAT, retention |
Interview line: "Most failures live at the system layer. Model-level metrics tell you about a model in a vacuum; product-level metrics tell you after the fact. System-level metrics — faithfulness, hallucination, tool-correctness — are the ones that catch issues before they become incidents."
2. The Metric Universe — Organised by Purpose¶
2A. Reference-Based Metrics¶
Need a labelled "ground truth" output to compare against.
| Metric | What It Measures | Scale | When to Use | Gotcha |
|---|---|---|---|---|
| Exact Match | Literal string equality | 0/1 | Deterministic tasks (calculator, structured extraction) | Useless for natural-language outputs |
| F1 (token overlap) | Precision/recall on shared tokens | 0–1 | Q&A short answers | Misses paraphrasing |
| BLEU | n-gram overlap with reference, precision-weighted | 0–1 | Machine translation; legacy | Penalises legitimate paraphrasing |
| ROUGE (1 / 2 / L) | n-gram and longest-common-subsequence recall vs reference | 0–1 | Summarisation | Same paraphrase blindness |
| METEOR | Tokens + synonyms + stemming alignment | 0–1 | MT, summarisation; less brittle than BLEU | Slower; English-biased |
| BERTScore | Cosine similarity of contextual embeddings | 0–1 | Semantic similarity beyond surface form | Embeddings inherit model bias |
| Semantic similarity | Sentence-embedding cosine | 0–1 | Open-ended Q&A | Threshold setting is arbitrary |
| Edit distance (Levenshtein) | Min character edits | int | Code, structured outputs | Character-level only |
2B. Reference-Free Metrics¶
No ground truth needed — judge the output against a criterion.
| Metric | What It Measures | Typical Implementation | Gotcha |
|---|---|---|---|
| Faithfulness | Are claims in the answer supported by the source context? | LLM-as-judge per claim | Judge bias; needs strong judge |
| Answer Relevance | Does the answer actually address the question? | LLM-as-judge or QA back-translation | Misses subtle off-topic responses |
| Groundedness | Output stays grounded in retrieved/provided context | Same as faithfulness; Azure ships a managed one | Same gotchas |
| Hallucination score | Inverse of faithfulness — share of unsupported claims | LLM-as-judge + entity overlap + citation check | No single signal is enough |
| Coherence | Logical / internal consistency of the answer | LLM-as-judge | Subjective; calibrate against human labels |
| Fluency | Grammar, readability | LLM-as-judge or classifier | Frontier models max this out — low signal |
| Toxicity | Harmful / offensive content | Detoxify, Perspective API, classifier | Cultural and contextual blind spots |
| Bias | Demographic / ideological skew | Counterfactual templates, StereoSet, BBQ | Coverage limited to tested attributes |
| Refusal correctness | System refuses when it should (and doesn't when it shouldn't) | Adversarial corpus + LLM-as-judge | Over-refusal is the silent killer |
| PII leakage | Output exposes personal data it shouldn't | Regex + NER + LLM-as-judge | Indirect references slip past regex |
| Calibration | Confidence aligns with actual accuracy | Brier score, ECE | Needs labelled outcomes; long-running |
| Citation validity | Citations exist and support the claim | Programmatic + judge | False citations score high on faithfulness if context contains them |
2C. Retrieval Metrics (RAG-specific)¶
| Metric | What It Measures | Range | Typical Threshold |
|---|---|---|---|
| Context Precision | Are the most-relevant chunks ranked highest? | 0–1 | > 0.7 |
| Context Recall | Did retrieval pull all needed chunks? | 0–1 | > 0.8 |
| Context Relevance | Are retrieved chunks actually relevant? | 0–1 | > 0.7 |
| MRR (Mean Reciprocal Rank) | Position of first relevant result | 0–1 | > 0.7 |
| nDCG | Ranking-aware quality of retrieval | 0–1 | > 0.8 |
| Hit Rate @ k | Was a relevant chunk in top-k? | 0/1 per query | > 0.9 @ k=5 |
2D. Agent-Specific Metrics¶
| Metric | What It Measures | Implementation |
|---|---|---|
| Tool Correctness | Right tool selected? | Compare trace to expected tool set |
| Argument Correctness | Right arguments passed? | Schema + value assertions |
| Plan Quality | Are the steps sensible and minimal? | LLM-as-judge or human review |
| Task Completion | Did the agent finish the task? | Goal-state oracle; LLM-as-judge |
| Trajectory Efficiency | Steps / tokens / cost vs minimum needed | Trace metrics vs golden trajectory |
| Recovery Rate | When a tool fails, does the agent recover? | Inject failures, measure outcomes |
| Goal-state Accuracy | Did the final state match the requested state? | State diff |
| Latency / Cost per task | Time and tokens per completed task | Trace-level capture |
| Self-consistency | Same task ⇒ same outcome across runs | Multi-sample agreement |
2E. Operational / Production Metrics¶
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Latency p50 / p95 / p99 | Response time distribution | UX; SLA |
| Token cost per request | Input + output tokens × price | Budget; ROI |
| Throughput / QPS | Concurrent capacity | Scale planning |
| Error rate | API / tool / guardrail failures per minute | Reliability |
| Refusal rate | % of requests refused | Over-refusal detection |
| User feedback signals | Thumbs-up/down, regenerate clicks, conversation length | Real-world signal |
| Drift score | Distribution shift in inputs or outputs over time | Early warning |
3. The Decision Matrix — Picking the Right Metric¶
When asked "how would you evaluate X?", your structured answer comes from this matrix.
3A. By task type¶
| Task Type | Primary Metrics | Secondary / Supporting |
|---|---|---|
| Open-ended Q&A | Answer relevance, faithfulness, hallucination | Coherence, citation validity |
| RAG-based Q&A | All of the above + context precision/recall | Latency, cost |
| Summarisation | Faithfulness, coverage, conciseness | ROUGE-L for legacy comparison |
| Translation | BERTScore, COMET (modern) | BLEU/METEOR (legacy) |
| Code generation | Functional correctness (run tests), pass@k | Compile rate, complexity |
| Structured extraction | Field-level exact match, schema conformance | F1 per field, hallucination |
| Classification | Accuracy, macro-F1, calibration | Confusion matrix per class |
| Chat / multi-turn | Conversational relevance, knowledge retention, completeness | Latency, refusal correctness |
| Agent / tool use | Tool correctness, task completion | Trajectory efficiency, recovery rate |
| Safety / refusal | Refusal correctness, over-refusal rate | Bias, toxicity, jailbreak rate |
3B. By evaluation method¶
| Method | When to Use | Strengths | Weaknesses |
|---|---|---|---|
| Deterministic check (exact match, schema, regex) | Output has structure | Fast, free, reliable | Misses paraphrase |
| Lexical metric (BLEU/ROUGE) | Legacy comparison; cheap baseline | Cheap, well-known | Paraphrase blindness |
| Embedding similarity | Semantic comparison, no labels | Cheap, captures paraphrase | Threshold is arbitrary |
| LLM-as-judge | Subjective qualities, no ground truth | Captures nuance | Cost, bias, non-determinism |
| Multi-judge consensus | High-stakes assessment | Reduces single-judge bias | Higher cost |
| Human evaluation | Final calibration, edge cases | Gold standard | Slow, expensive, doesn't scale |
| Programmatic check (citations exist, length budget) | Behavioural assertions | Cheap, reliable | Limited coverage |
| A/B test in production | Real-world signal | Captures user impact | Slow; needs traffic; product-level only |
3C. By stage of development¶
| Stage | Primary Eval | Tooling |
|---|---|---|
| Pre-development / spec | Define metrics + thresholds before building | Spreadsheet / Notion |
| Development | Unit-style eval on golden datasets | DeepEval / pytest |
| Integration | End-to-end on representative inputs | Ragas / DeepEval / custom |
| Pre-release | Full eval suite + adversarial / red-team | DeepEval Red Teamer, PyRIT, Garak |
| Post-release | Monitoring + drift + user feedback | Langfuse / LangSmith / Arize |
| Continuous improvement | A/B testing, fine-tune iteration | LangSmith / Braintrust / custom |
4. Quality Attribute Matrix — Mapping ISO/IEC 25010 to AI Systems¶
Auditors love this framing. Most AI candidates don't have it.
| Attribute | Traditional Software | AI System Equivalent |
|---|---|---|
| Functional suitability | Feature works to spec | Accuracy, completeness, appropriateness — measured by eval suite |
| Reliability | MTBF, fault tolerance | Consistency across runs, refusal correctness, recovery from tool failure |
| Performance efficiency | Latency, throughput | Latency p50/p95/p99, tokens-per-request, cost-per-task |
| Compatibility | Co-existence, interoperability | Model/API version compatibility, MCP conformance |
| Usability | UX, accessibility | Coherence, fluency, helpfulness, refusal phrasing |
| Security | Confidentiality, integrity, authenticity | Prompt-injection resilience, PII leakage, auth boundary enforcement |
| Maintainability | Modularity, modifiability | Prompt versioning, eval reproducibility, regression coverage |
| Portability | Adaptability, installability | Provider abstraction (LiteLLM), model-swap resilience |
Interview line: "I map AI quality work onto the same ISO 25010 attributes auditors use for traditional software — it gives the audit team a frame they recognise and ensures we cover security and reliability, not just functional correctness."
5. Risk → Test-Category Matrix¶
The most useful matrix for designing a test programme. Maps known failure classes to coverage.
| Risk | Test Category | Example Assertion |
|---|---|---|
| Hallucination | Faithfulness eval, citation validity | Faithfulness > 0.85 on golden set |
| Stale knowledge | RAG retrieval refresh; date-bound queries | Answers for post-cutoff dates use retrieval |
| Prompt injection (direct) | Adversarial input corpus | System ignores IGNORE PREVIOUS patterns |
| Prompt injection (indirect) | Poisoned-document corpus in RAG | Embedded instructions in retrieved docs not executed |
| Jailbreak | Categorised jailbreak corpus | Model refuses across role-play / hypothetical / encoded variants |
| Bias | Counterfactual templates per protected attribute | Output behaviour invariant across attribute |
| Toxicity | Adversarial generations targeting harm | Toxicity classifier < threshold |
| PII leakage | Probes attempting data extraction | No PII in output to unauthorised user |
| Authority escalation | Agent permission tests | Agent refuses tool requiring elevated scope |
| Tool misuse | Trace-level assertions on tool selection | Right tool, right args, right order |
| Cascading hallucination | Multi-step trace eval | Step N output supported by step N-1 result |
| Latency amplification | Trace latency budget per workflow | p95 < N seconds end-to-end |
| Cost amplification | Token budget per workflow | Tokens per task < N |
| Drift | Continuous golden-set rerun | Metric delta vs baseline < tolerance |
| Refusal regression | Benign-query refusal rate | Over-refusal < 2% on benign queries |
| Schema drift | Output schema conformance | Pydantic validation passes 100% |
6. LLM-as-Judge — The Critical Sub-Skill¶
LLM-as-judge appears in almost every metric above. Treat it as a system you test, not a tool you trust.
Judge calibration checklist¶
| Step | Why It Matters |
|---|---|
| Pick a strong judge (≥ GPT-4 class) | Weak judges introduce noise indistinguishable from real signal |
| Use a different model than the generator | Same-model judging inflates scores |
| Validate the judge against human labels on a sample | Without this you don't know how reliable the score is |
| Calibrate the rubric — explicit criteria, not vibes | Vague rubrics = unstable scores |
| Pin temperature = 0 for the judge | Reduces run-to-run noise |
| Use chain-of-thought before the verdict | Improves judge accuracy on subtle cases |
| Multi-judge consensus for high-stakes assessment | Three judges + majority vote beats one |
| Re-validate when models update | Judges drift too |
Judge anti-patterns to call out in interview¶
| Anti-pattern | Why It's Bad |
|---|---|
| Single judge, same model as generator | Inflated scores, no second opinion |
| Boolean verdict ("good / bad") with no reasoning | Lost signal; hard to debug |
| No human validation sample | Operating blind |
| Judge prompt mixed with generation prompt | Coupled changes, hard to debug regressions |
| Trusting the score over rerunning on a labelled subset | Confidence without calibration |
7. Framework Comparison Matrix¶
When asked "which framework would you use?", this is the structured answer.
7A. Open-source evaluation frameworks¶
| Framework | Style | Strength | Weakness | Best Fit |
|---|---|---|---|---|
| Ragas | Dataset-batch | RAG-metric coverage | Less ergonomic for behavioural tests | RAG-heavy projects |
| DeepEval | pytest-native | Broad — RAG, safety, agents, red-team | Cost adds up at scale | Teams wanting CI-integrated AI tests |
| TruLens | Feedback functions | Programmable evaluators | Smaller ecosystem | Custom metric needs |
| Promptfoo | YAML-config CLI | Easy CI integration + red-team | Less Python flexibility | Polyglot teams |
| OpenAI Evals | Eval registry | Standard format; community evals | OpenAI-leaning | Pre-existing OpenAI shop |
| LangChain Evaluators | Chain-native | Tight LangChain fit | Lock-in | LangChain projects |
| Arize Phoenix Evals | Template-driven | Pairs with tracing | Less standalone | Already-on-Arize teams |
7B. Observability / tracing platforms¶
| Platform | Open Source? | Tracing | Eval | Dataset Mgmt | Self-Host |
|---|---|---|---|---|---|
| Langfuse | Yes | ✓✓ | ✓ | ✓ | ✓ |
| LangSmith | No (cloud) | ✓✓ | ✓ | ✓ | ✗ |
| Arize Phoenix / AX | Phoenix yes; AX no | ✓✓ | ✓ | ✓ | Phoenix only |
| Braintrust | No | ✓ | ✓✓ | ✓✓ | ✗ |
| Helicone | Yes | ✓ | basic | ✗ | ✓ |
| W&B Weave | No | ✓ | ✓ | ✓ | ✗ |
| OpenTelemetry GenAI | Yes (spec) | ✓✓ | via integration | via integration | ✓ |
Interview line: "I'd anchor on an open-source primitive — Ragas or DeepEval for metrics, Langfuse for tracing — because eval logic belongs in version control. Commercial layers earn their place when dashboards, dataset versioning, or non-engineer collaboration become bottlenecks."
7C. Red-team & security frameworks¶
| Tool | Type | Strength |
|---|---|---|
| PyRIT | Microsoft Python framework | Build custom red-team campaigns |
| Garak | NVIDIA CLI scanner | 100+ probes; broad baseline |
| AgentDojo | Academic benchmark | Agent prompt-injection robustness |
| Promptfoo (red-team mode) | YAML + CLI | OWASP LLM Top 10 coverage |
| DeepEval Red Teamer | Python | 50+ vulnerability categories in code |
| Lakera Red | Commercial | Hosted, runtime + offline |
| Mindgard | Commercial | Continuous adversarial testing SaaS |
See red-blue-purple-team-ai-faq.md for theory; commercial-llm-mcp-testing-tools.md for the full vendor landscape.
8. Threshold-Setting Matrix¶
The "what number do I put in CI" question. There's no universal answer — but there's a defensible method.
| Approach | How It Works | When to Use |
|---|---|---|
| Baseline + safety margin | Measure current model + prompt on golden set; threshold = baseline minus tolerance | Most common; honest |
| Reference-best minus delta | Threshold relative to known-best (GPT-4 etc.) | When you're optimising for parity |
| Business-derived | Backwards from product KPI (e.g. "we need 90% deflection to break even") | Strong product ties |
| Adversarial absolute | Safety / refusal tests: must pass 100% on critical categories | Safety gates |
| Distribution-based | Pass if score distribution overlap with baseline > X | When point estimates are noisy |
| Human-validated calibration | Threshold corresponds to "would a reviewer accept this" rate | High-stakes content |
Anti-pattern: picking thresholds from intuition. "It should be > 0.8" — based on what? Always derive from data.
9. Tiering Matrix — Findings & Gates¶
Critical for Lead / senior roles. Without tiers, every issue is "P1" and nothing moves.
| Tier | Definition | Action | Examples |
|---|---|---|---|
| Critical | Safety, PII leakage, authority escalation, regulator-relevant policy violation | Blocks release | Jailbreak success, PII in output, prompt injection bypassing guardrail |
| High | Quality regression on a measured metric beyond tolerance | Waivable with explicit sign-off | Faithfulness drop > 5%, latency p95 > budget |
| Medium | Quality regression within tolerance; new edge case identified | Tracked; fix in next sprint | Marginal score drop, new niche failure mode |
| Low | Cosmetic / non-functional | Backlog | Phrasing preference, minor fluency drop |
Interview line: "Tiers aren't a vibe — they're a policy. Published, signed off, defended. Without them everything is critical and nothing actually moves."
10. Rapid-Fire Q&A — Interview-Ready¶
Conceptual¶
-
Q: What's the single most important LLM metric? None. Every interesting evaluation stacks multiple signals. The honest answer if pressed: faithfulness for RAG, task completion for agents, refusal correctness for safety.
-
Q: What's wrong with BLEU and ROUGE for LLM eval? Paraphrase blindness. They reward surface overlap with a reference, which penalises legitimately rephrased correct answers. Use BERTScore or LLM-as-judge for semantic comparison.
-
Q: How do you evaluate without ground-truth labels? Reference-free metrics — faithfulness, answer relevance, coherence — using LLM-as-judge with a calibrated rubric, paired with deterministic checks (citations, schema, length).
-
Q: How do you stop LLM-as-judge from being a flaky test source? Temperature zero on the judge, a different model from the generator, chain-of-thought in the rubric, multi-judge consensus for high stakes, periodic human calibration, version-pin the judge.
RAG-specific¶
-
Q: What does Ragas measure that DeepEval doesn't, or vice versa? Mostly overlap. Ragas is dataset-batch ergonomic for RAG metric reporting. DeepEval is pytest-style and broader (safety, agents, red-team). Many teams use both.
-
Q: Faithfulness vs answer relevance? Faithfulness = the answer is grounded in retrieved context (didn't make things up). Answer relevance = the answer is about the question. They fail independently.
-
Q: Why measure retrieval and generation separately? Bad retrieval poisons even good generation. Good generation can still misuse retrieval. Localising regressions requires isolating each layer.
Agent-specific¶
-
Q: How do you test an agent end-to-end? Trace-level assertions on the tool-call sequence (which tools, what args, what order), plus task-completion oracle, plus latency/cost budgets. The trace matters as much as the final answer.
-
Q: What failure modes are unique to agentic systems? Tool misuse, cascading hallucination, latency amplification, cost amplification, authority creep, infinite reasoning loops, indirect injection through tool outputs.
-
Q: How do you handle non-deterministic agent traces? Assert on properties not exact paths. Required tools must be present; arguments must satisfy schema; outcome must match goal-state. The exact path may vary.
Process¶
-
Q: How do you decide when an AI feature is "ready to ship"? Threshold-met on the eval suite, safety gates passed at 100%, performance and cost within budget, a documented residual-risk waiver if any, and a rollback plan if metrics drift in production.
-
Q: How do you handle a model upgrade (e.g. GPT-4 → GPT-5)? Full eval suite rerun on the new model with the old prompts; investigate any metric delta beyond tolerance; A/B test in production on a percentage of traffic; only promote if business KPIs hold.
-
Q: How do you build an eval golden set? Start from real user queries; add edge cases (ambiguous, multi-hop, no-answer-available, adversarial); aim for coverage of failure modes, not quantity; version the dataset alongside the model.
11. Common Anti-Patterns¶
| Anti-pattern | Why It's Bad | Fix |
|---|---|---|
| Single-metric eval ("faithfulness is 0.85, ship it") | One number masks failure modes | Stack 3+ signals; report per-category |
| Same model as judge and generator | Inflated scores | Use a different (stronger) judge |
| No human calibration of LLM-as-judge | Unknown reliability | Validate on a labelled sample |
| Thresholds picked from intuition | Indefensible at audit | Derive from baseline + safety margin |
| Eval suite that never fails | Suspect — likely undersized | Adversarial seeding; coverage map |
| Pass/fail only — no trend over time | Misses drift | Plot metrics; alert on deltas |
| Tests written after the prompt | Tests pass because they encode current behaviour | Write expected behaviour first |
| Mixing eval and observability | Confused signals | Eval = pre-release; observability = production; separate stores |
| Trusting one judge model | Single-point bias | Multi-judge or human spot checks |
12. Cross-References¶
- RAG metric deep dive →
ragas-faq.md - pytest-style framework →
deepeval-faq.md - Adversarial / red-team theory →
red-blue-purple-team-ai-faq.md - MCP-specific testing →
mcp-servers-faq.md,mcp-testing-roadmap.md - Architecture context →
rag-vs-agents-vs-agentic-rag.md - Platform-specific testing →
enterprise-llm-platforms.md - Tool landscape →
commercial-llm-mcp-testing-tools.md - Lifecycle (pre-prod → prod) →
llm-testing-lifecycle.md
13. Master Interview Sound-Bites¶
- "Evaluation has three layers — model, system, product. Most failures live at the system layer; most metrics live at the model layer. Bridging that gap is where QE adds value."
- "No single metric is enough. I stack at least three signals — LLM-as-judge for nuance, deterministic checks for what can be checked, and citation validity for grounded outputs."
- "Thresholds aren't intuition — they're derived. Baseline plus safety margin; safety categories at 100%. Anything else is indefensible at audit."
- "I tier findings explicitly — critical blocks release, high is waivable with sign-off, medium is tracked. Without tiers everything is P1 and nothing moves."
- "For agents the trace matters as much as the answer. A right answer reached through the wrong tools is still a quality defect — cost, latency, audit."
- "LLM-as-judge is a system I test, not a tool I trust. Strong judge, different model from the generator, validated against humans, multi-judge for high stakes."