Skip to content

LLM & Agent Evaluation Matrix

The master reference for evaluating LLMs, RAG systems, and AI agents. Built as matrices — easy to scan, easy to deliver in an interview. Use this when you need to pick a metric, justify a metric, or explain why a metric isn't enough on its own.


1. The Evaluation Landscape — Three Layers

Most candidates conflate these three. Separating them is a fast signal of depth.

Layer What's Being Evaluated Typical Metrics
Model evaluation The model itself, in isolation Benchmarks (MMLU, HumanEval, GSM8K, ARC), perplexity, calibration
System evaluation The model + prompt + retrieval + tools + guardrails as a whole Faithfulness, answer relevance, hallucination rate, refusal rate, citation accuracy
Product / outcome evaluation Did the system help the user achieve their goal? Task success rate, time-to-resolution, deflection rate, CSAT, retention

Interview line: "Most failures live at the system layer. Model-level metrics tell you about a model in a vacuum; product-level metrics tell you after the fact. System-level metrics — faithfulness, hallucination, tool-correctness — are the ones that catch issues before they become incidents."


2. The Metric Universe — Organised by Purpose

2A. Reference-Based Metrics

Need a labelled "ground truth" output to compare against.

Metric What It Measures Scale When to Use Gotcha
Exact Match Literal string equality 0/1 Deterministic tasks (calculator, structured extraction) Useless for natural-language outputs
F1 (token overlap) Precision/recall on shared tokens 0–1 Q&A short answers Misses paraphrasing
BLEU n-gram overlap with reference, precision-weighted 0–1 Machine translation; legacy Penalises legitimate paraphrasing
ROUGE (1 / 2 / L) n-gram and longest-common-subsequence recall vs reference 0–1 Summarisation Same paraphrase blindness
METEOR Tokens + synonyms + stemming alignment 0–1 MT, summarisation; less brittle than BLEU Slower; English-biased
BERTScore Cosine similarity of contextual embeddings 0–1 Semantic similarity beyond surface form Embeddings inherit model bias
Semantic similarity Sentence-embedding cosine 0–1 Open-ended Q&A Threshold setting is arbitrary
Edit distance (Levenshtein) Min character edits int Code, structured outputs Character-level only

2B. Reference-Free Metrics

No ground truth needed — judge the output against a criterion.

Metric What It Measures Typical Implementation Gotcha
Faithfulness Are claims in the answer supported by the source context? LLM-as-judge per claim Judge bias; needs strong judge
Answer Relevance Does the answer actually address the question? LLM-as-judge or QA back-translation Misses subtle off-topic responses
Groundedness Output stays grounded in retrieved/provided context Same as faithfulness; Azure ships a managed one Same gotchas
Hallucination score Inverse of faithfulness — share of unsupported claims LLM-as-judge + entity overlap + citation check No single signal is enough
Coherence Logical / internal consistency of the answer LLM-as-judge Subjective; calibrate against human labels
Fluency Grammar, readability LLM-as-judge or classifier Frontier models max this out — low signal
Toxicity Harmful / offensive content Detoxify, Perspective API, classifier Cultural and contextual blind spots
Bias Demographic / ideological skew Counterfactual templates, StereoSet, BBQ Coverage limited to tested attributes
Refusal correctness System refuses when it should (and doesn't when it shouldn't) Adversarial corpus + LLM-as-judge Over-refusal is the silent killer
PII leakage Output exposes personal data it shouldn't Regex + NER + LLM-as-judge Indirect references slip past regex
Calibration Confidence aligns with actual accuracy Brier score, ECE Needs labelled outcomes; long-running
Citation validity Citations exist and support the claim Programmatic + judge False citations score high on faithfulness if context contains them

2C. Retrieval Metrics (RAG-specific)

Metric What It Measures Range Typical Threshold
Context Precision Are the most-relevant chunks ranked highest? 0–1 > 0.7
Context Recall Did retrieval pull all needed chunks? 0–1 > 0.8
Context Relevance Are retrieved chunks actually relevant? 0–1 > 0.7
MRR (Mean Reciprocal Rank) Position of first relevant result 0–1 > 0.7
nDCG Ranking-aware quality of retrieval 0–1 > 0.8
Hit Rate @ k Was a relevant chunk in top-k? 0/1 per query > 0.9 @ k=5

2D. Agent-Specific Metrics

Metric What It Measures Implementation
Tool Correctness Right tool selected? Compare trace to expected tool set
Argument Correctness Right arguments passed? Schema + value assertions
Plan Quality Are the steps sensible and minimal? LLM-as-judge or human review
Task Completion Did the agent finish the task? Goal-state oracle; LLM-as-judge
Trajectory Efficiency Steps / tokens / cost vs minimum needed Trace metrics vs golden trajectory
Recovery Rate When a tool fails, does the agent recover? Inject failures, measure outcomes
Goal-state Accuracy Did the final state match the requested state? State diff
Latency / Cost per task Time and tokens per completed task Trace-level capture
Self-consistency Same task ⇒ same outcome across runs Multi-sample agreement

2E. Operational / Production Metrics

Metric What It Measures Why It Matters
Latency p50 / p95 / p99 Response time distribution UX; SLA
Token cost per request Input + output tokens × price Budget; ROI
Throughput / QPS Concurrent capacity Scale planning
Error rate API / tool / guardrail failures per minute Reliability
Refusal rate % of requests refused Over-refusal detection
User feedback signals Thumbs-up/down, regenerate clicks, conversation length Real-world signal
Drift score Distribution shift in inputs or outputs over time Early warning

3. The Decision Matrix — Picking the Right Metric

When asked "how would you evaluate X?", your structured answer comes from this matrix.

3A. By task type

Task Type Primary Metrics Secondary / Supporting
Open-ended Q&A Answer relevance, faithfulness, hallucination Coherence, citation validity
RAG-based Q&A All of the above + context precision/recall Latency, cost
Summarisation Faithfulness, coverage, conciseness ROUGE-L for legacy comparison
Translation BERTScore, COMET (modern) BLEU/METEOR (legacy)
Code generation Functional correctness (run tests), pass@k Compile rate, complexity
Structured extraction Field-level exact match, schema conformance F1 per field, hallucination
Classification Accuracy, macro-F1, calibration Confusion matrix per class
Chat / multi-turn Conversational relevance, knowledge retention, completeness Latency, refusal correctness
Agent / tool use Tool correctness, task completion Trajectory efficiency, recovery rate
Safety / refusal Refusal correctness, over-refusal rate Bias, toxicity, jailbreak rate

3B. By evaluation method

Method When to Use Strengths Weaknesses
Deterministic check (exact match, schema, regex) Output has structure Fast, free, reliable Misses paraphrase
Lexical metric (BLEU/ROUGE) Legacy comparison; cheap baseline Cheap, well-known Paraphrase blindness
Embedding similarity Semantic comparison, no labels Cheap, captures paraphrase Threshold is arbitrary
LLM-as-judge Subjective qualities, no ground truth Captures nuance Cost, bias, non-determinism
Multi-judge consensus High-stakes assessment Reduces single-judge bias Higher cost
Human evaluation Final calibration, edge cases Gold standard Slow, expensive, doesn't scale
Programmatic check (citations exist, length budget) Behavioural assertions Cheap, reliable Limited coverage
A/B test in production Real-world signal Captures user impact Slow; needs traffic; product-level only

3C. By stage of development

Stage Primary Eval Tooling
Pre-development / spec Define metrics + thresholds before building Spreadsheet / Notion
Development Unit-style eval on golden datasets DeepEval / pytest
Integration End-to-end on representative inputs Ragas / DeepEval / custom
Pre-release Full eval suite + adversarial / red-team DeepEval Red Teamer, PyRIT, Garak
Post-release Monitoring + drift + user feedback Langfuse / LangSmith / Arize
Continuous improvement A/B testing, fine-tune iteration LangSmith / Braintrust / custom

4. Quality Attribute Matrix — Mapping ISO/IEC 25010 to AI Systems

Auditors love this framing. Most AI candidates don't have it.

Attribute Traditional Software AI System Equivalent
Functional suitability Feature works to spec Accuracy, completeness, appropriateness — measured by eval suite
Reliability MTBF, fault tolerance Consistency across runs, refusal correctness, recovery from tool failure
Performance efficiency Latency, throughput Latency p50/p95/p99, tokens-per-request, cost-per-task
Compatibility Co-existence, interoperability Model/API version compatibility, MCP conformance
Usability UX, accessibility Coherence, fluency, helpfulness, refusal phrasing
Security Confidentiality, integrity, authenticity Prompt-injection resilience, PII leakage, auth boundary enforcement
Maintainability Modularity, modifiability Prompt versioning, eval reproducibility, regression coverage
Portability Adaptability, installability Provider abstraction (LiteLLM), model-swap resilience

Interview line: "I map AI quality work onto the same ISO 25010 attributes auditors use for traditional software — it gives the audit team a frame they recognise and ensures we cover security and reliability, not just functional correctness."


5. Risk → Test-Category Matrix

The most useful matrix for designing a test programme. Maps known failure classes to coverage.

Risk Test Category Example Assertion
Hallucination Faithfulness eval, citation validity Faithfulness > 0.85 on golden set
Stale knowledge RAG retrieval refresh; date-bound queries Answers for post-cutoff dates use retrieval
Prompt injection (direct) Adversarial input corpus System ignores IGNORE PREVIOUS patterns
Prompt injection (indirect) Poisoned-document corpus in RAG Embedded instructions in retrieved docs not executed
Jailbreak Categorised jailbreak corpus Model refuses across role-play / hypothetical / encoded variants
Bias Counterfactual templates per protected attribute Output behaviour invariant across attribute
Toxicity Adversarial generations targeting harm Toxicity classifier < threshold
PII leakage Probes attempting data extraction No PII in output to unauthorised user
Authority escalation Agent permission tests Agent refuses tool requiring elevated scope
Tool misuse Trace-level assertions on tool selection Right tool, right args, right order
Cascading hallucination Multi-step trace eval Step N output supported by step N-1 result
Latency amplification Trace latency budget per workflow p95 < N seconds end-to-end
Cost amplification Token budget per workflow Tokens per task < N
Drift Continuous golden-set rerun Metric delta vs baseline < tolerance
Refusal regression Benign-query refusal rate Over-refusal < 2% on benign queries
Schema drift Output schema conformance Pydantic validation passes 100%

6. LLM-as-Judge — The Critical Sub-Skill

LLM-as-judge appears in almost every metric above. Treat it as a system you test, not a tool you trust.

Judge calibration checklist

Step Why It Matters
Pick a strong judge (≥ GPT-4 class) Weak judges introduce noise indistinguishable from real signal
Use a different model than the generator Same-model judging inflates scores
Validate the judge against human labels on a sample Without this you don't know how reliable the score is
Calibrate the rubric — explicit criteria, not vibes Vague rubrics = unstable scores
Pin temperature = 0 for the judge Reduces run-to-run noise
Use chain-of-thought before the verdict Improves judge accuracy on subtle cases
Multi-judge consensus for high-stakes assessment Three judges + majority vote beats one
Re-validate when models update Judges drift too

Judge anti-patterns to call out in interview

Anti-pattern Why It's Bad
Single judge, same model as generator Inflated scores, no second opinion
Boolean verdict ("good / bad") with no reasoning Lost signal; hard to debug
No human validation sample Operating blind
Judge prompt mixed with generation prompt Coupled changes, hard to debug regressions
Trusting the score over rerunning on a labelled subset Confidence without calibration

7. Framework Comparison Matrix

When asked "which framework would you use?", this is the structured answer.

7A. Open-source evaluation frameworks

Framework Style Strength Weakness Best Fit
Ragas Dataset-batch RAG-metric coverage Less ergonomic for behavioural tests RAG-heavy projects
DeepEval pytest-native Broad — RAG, safety, agents, red-team Cost adds up at scale Teams wanting CI-integrated AI tests
TruLens Feedback functions Programmable evaluators Smaller ecosystem Custom metric needs
Promptfoo YAML-config CLI Easy CI integration + red-team Less Python flexibility Polyglot teams
OpenAI Evals Eval registry Standard format; community evals OpenAI-leaning Pre-existing OpenAI shop
LangChain Evaluators Chain-native Tight LangChain fit Lock-in LangChain projects
Arize Phoenix Evals Template-driven Pairs with tracing Less standalone Already-on-Arize teams

7B. Observability / tracing platforms

Platform Open Source? Tracing Eval Dataset Mgmt Self-Host
Langfuse Yes ✓✓
LangSmith No (cloud) ✓✓
Arize Phoenix / AX Phoenix yes; AX no ✓✓ Phoenix only
Braintrust No ✓✓ ✓✓
Helicone Yes basic
W&B Weave No
OpenTelemetry GenAI Yes (spec) ✓✓ via integration via integration

Interview line: "I'd anchor on an open-source primitive — Ragas or DeepEval for metrics, Langfuse for tracing — because eval logic belongs in version control. Commercial layers earn their place when dashboards, dataset versioning, or non-engineer collaboration become bottlenecks."

7C. Red-team & security frameworks

Tool Type Strength
PyRIT Microsoft Python framework Build custom red-team campaigns
Garak NVIDIA CLI scanner 100+ probes; broad baseline
AgentDojo Academic benchmark Agent prompt-injection robustness
Promptfoo (red-team mode) YAML + CLI OWASP LLM Top 10 coverage
DeepEval Red Teamer Python 50+ vulnerability categories in code
Lakera Red Commercial Hosted, runtime + offline
Mindgard Commercial Continuous adversarial testing SaaS

See red-blue-purple-team-ai-faq.md for theory; commercial-llm-mcp-testing-tools.md for the full vendor landscape.


8. Threshold-Setting Matrix

The "what number do I put in CI" question. There's no universal answer — but there's a defensible method.

Approach How It Works When to Use
Baseline + safety margin Measure current model + prompt on golden set; threshold = baseline minus tolerance Most common; honest
Reference-best minus delta Threshold relative to known-best (GPT-4 etc.) When you're optimising for parity
Business-derived Backwards from product KPI (e.g. "we need 90% deflection to break even") Strong product ties
Adversarial absolute Safety / refusal tests: must pass 100% on critical categories Safety gates
Distribution-based Pass if score distribution overlap with baseline > X When point estimates are noisy
Human-validated calibration Threshold corresponds to "would a reviewer accept this" rate High-stakes content

Anti-pattern: picking thresholds from intuition. "It should be > 0.8" — based on what? Always derive from data.


9. Tiering Matrix — Findings & Gates

Critical for Lead / senior roles. Without tiers, every issue is "P1" and nothing moves.

Tier Definition Action Examples
Critical Safety, PII leakage, authority escalation, regulator-relevant policy violation Blocks release Jailbreak success, PII in output, prompt injection bypassing guardrail
High Quality regression on a measured metric beyond tolerance Waivable with explicit sign-off Faithfulness drop > 5%, latency p95 > budget
Medium Quality regression within tolerance; new edge case identified Tracked; fix in next sprint Marginal score drop, new niche failure mode
Low Cosmetic / non-functional Backlog Phrasing preference, minor fluency drop

Interview line: "Tiers aren't a vibe — they're a policy. Published, signed off, defended. Without them everything is critical and nothing actually moves."


10. Rapid-Fire Q&A — Interview-Ready

Conceptual

  • Q: What's the single most important LLM metric? None. Every interesting evaluation stacks multiple signals. The honest answer if pressed: faithfulness for RAG, task completion for agents, refusal correctness for safety.

  • Q: What's wrong with BLEU and ROUGE for LLM eval? Paraphrase blindness. They reward surface overlap with a reference, which penalises legitimately rephrased correct answers. Use BERTScore or LLM-as-judge for semantic comparison.

  • Q: How do you evaluate without ground-truth labels? Reference-free metrics — faithfulness, answer relevance, coherence — using LLM-as-judge with a calibrated rubric, paired with deterministic checks (citations, schema, length).

  • Q: How do you stop LLM-as-judge from being a flaky test source? Temperature zero on the judge, a different model from the generator, chain-of-thought in the rubric, multi-judge consensus for high stakes, periodic human calibration, version-pin the judge.

RAG-specific

  • Q: What does Ragas measure that DeepEval doesn't, or vice versa? Mostly overlap. Ragas is dataset-batch ergonomic for RAG metric reporting. DeepEval is pytest-style and broader (safety, agents, red-team). Many teams use both.

  • Q: Faithfulness vs answer relevance? Faithfulness = the answer is grounded in retrieved context (didn't make things up). Answer relevance = the answer is about the question. They fail independently.

  • Q: Why measure retrieval and generation separately? Bad retrieval poisons even good generation. Good generation can still misuse retrieval. Localising regressions requires isolating each layer.

Agent-specific

  • Q: How do you test an agent end-to-end? Trace-level assertions on the tool-call sequence (which tools, what args, what order), plus task-completion oracle, plus latency/cost budgets. The trace matters as much as the final answer.

  • Q: What failure modes are unique to agentic systems? Tool misuse, cascading hallucination, latency amplification, cost amplification, authority creep, infinite reasoning loops, indirect injection through tool outputs.

  • Q: How do you handle non-deterministic agent traces? Assert on properties not exact paths. Required tools must be present; arguments must satisfy schema; outcome must match goal-state. The exact path may vary.

Process

  • Q: How do you decide when an AI feature is "ready to ship"? Threshold-met on the eval suite, safety gates passed at 100%, performance and cost within budget, a documented residual-risk waiver if any, and a rollback plan if metrics drift in production.

  • Q: How do you handle a model upgrade (e.g. GPT-4 → GPT-5)? Full eval suite rerun on the new model with the old prompts; investigate any metric delta beyond tolerance; A/B test in production on a percentage of traffic; only promote if business KPIs hold.

  • Q: How do you build an eval golden set? Start from real user queries; add edge cases (ambiguous, multi-hop, no-answer-available, adversarial); aim for coverage of failure modes, not quantity; version the dataset alongside the model.


11. Common Anti-Patterns

Anti-pattern Why It's Bad Fix
Single-metric eval ("faithfulness is 0.85, ship it") One number masks failure modes Stack 3+ signals; report per-category
Same model as judge and generator Inflated scores Use a different (stronger) judge
No human calibration of LLM-as-judge Unknown reliability Validate on a labelled sample
Thresholds picked from intuition Indefensible at audit Derive from baseline + safety margin
Eval suite that never fails Suspect — likely undersized Adversarial seeding; coverage map
Pass/fail only — no trend over time Misses drift Plot metrics; alert on deltas
Tests written after the prompt Tests pass because they encode current behaviour Write expected behaviour first
Mixing eval and observability Confused signals Eval = pre-release; observability = production; separate stores
Trusting one judge model Single-point bias Multi-judge or human spot checks

12. Cross-References

  • RAG metric deep diveragas-faq.md
  • pytest-style frameworkdeepeval-faq.md
  • Adversarial / red-team theoryred-blue-purple-team-ai-faq.md
  • MCP-specific testingmcp-servers-faq.md, mcp-testing-roadmap.md
  • Architecture contextrag-vs-agents-vs-agentic-rag.md
  • Platform-specific testingenterprise-llm-platforms.md
  • Tool landscapecommercial-llm-mcp-testing-tools.md
  • Lifecycle (pre-prod → prod)llm-testing-lifecycle.md

13. Master Interview Sound-Bites

  • "Evaluation has three layers — model, system, product. Most failures live at the system layer; most metrics live at the model layer. Bridging that gap is where QE adds value."
  • "No single metric is enough. I stack at least three signals — LLM-as-judge for nuance, deterministic checks for what can be checked, and citation validity for grounded outputs."
  • "Thresholds aren't intuition — they're derived. Baseline plus safety margin; safety categories at 100%. Anything else is indefensible at audit."
  • "I tier findings explicitly — critical blocks release, high is waivable with sign-off, medium is tracked. Without tiers everything is P1 and nothing moves."
  • "For agents the trace matters as much as the answer. A right answer reached through the wrong tools is still a quality defect — cost, latency, audit."
  • "LLM-as-judge is a system I test, not a tool I trust. Strong judge, different model from the generator, validated against humans, multi-judge for high stakes."