LLM & Agent Evaluation Matrix¶

The master reference for evaluating LLMs, RAG systems, and AI agents. Built as matrices — easy to scan, easy to deliver in an interview. Use this when you need to pick a metric, justify a metric, or explain why a metric isn't enough on its own.

1. The Evaluation Landscape — Three Layers¶

Most candidates conflate these three. Separating them is a fast signal of depth.

Layer	What's Being Evaluated	Typical Metrics
Model evaluation	The model itself, in isolation	Benchmarks (MMLU, HumanEval, GSM8K, ARC), perplexity, calibration
System evaluation	The model + prompt + retrieval + tools + guardrails as a whole	Faithfulness, answer relevance, hallucination rate, refusal rate, citation accuracy
Product / outcome evaluation	Did the system help the user achieve their goal?	Task success rate, time-to-resolution, deflection rate, CSAT, retention

Interview line: "Most failures live at the system layer. Model-level metrics tell you about a model in a vacuum; product-level metrics tell you after the fact. System-level metrics — faithfulness, hallucination, tool-correctness — are the ones that catch issues before they become incidents."

2. The Metric Universe — Organised by Purpose¶

2A. Reference-Based Metrics¶

Need a labelled "ground truth" output to compare against.

Metric	What It Measures	Scale	When to Use	Gotcha
Exact Match	Literal string equality	0/1	Deterministic tasks (calculator, structured extraction)	Useless for natural-language outputs
F1 (token overlap)	Precision/recall on shared tokens	0–1	Q&A short answers	Misses paraphrasing
BLEU	n-gram overlap with reference, precision-weighted	0–1	Machine translation; legacy	Penalises legitimate paraphrasing
ROUGE (1 / 2 / L)	n-gram and longest-common-subsequence recall vs reference	0–1	Summarisation	Same paraphrase blindness
METEOR	Tokens + synonyms + stemming alignment	0–1	MT, summarisation; less brittle than BLEU	Slower; English-biased
BERTScore	Cosine similarity of contextual embeddings	0–1	Semantic similarity beyond surface form	Embeddings inherit model bias
Semantic similarity	Sentence-embedding cosine	0–1	Open-ended Q&A	Threshold setting is arbitrary
Edit distance (Levenshtein)	Min character edits	int	Code, structured outputs	Character-level only

2B. Reference-Free Metrics¶

No ground truth needed — judge the output against a criterion.

Metric	What It Measures	Typical Implementation	Gotcha
Faithfulness	Are claims in the answer supported by the source context?	LLM-as-judge per claim	Judge bias; needs strong judge
Answer Relevance	Does the answer actually address the question?	LLM-as-judge or QA back-translation	Misses subtle off-topic responses
Groundedness	Output stays grounded in retrieved/provided context	Same as faithfulness; Azure ships a managed one	Same gotchas
Hallucination score	Inverse of faithfulness — share of unsupported claims	LLM-as-judge + entity overlap + citation check	No single signal is enough
Coherence	Logical / internal consistency of the answer	LLM-as-judge	Subjective; calibrate against human labels
Fluency	Grammar, readability	LLM-as-judge or classifier	Frontier models max this out — low signal
Toxicity	Harmful / offensive content	Detoxify, Perspective API, classifier	Cultural and contextual blind spots
Bias	Demographic / ideological skew	Counterfactual templates, StereoSet, BBQ	Coverage limited to tested attributes
Refusal correctness	System refuses when it should (and doesn't when it shouldn't)	Adversarial corpus + LLM-as-judge	Over-refusal is the silent killer
PII leakage	Output exposes personal data it shouldn't	Regex + NER + LLM-as-judge	Indirect references slip past regex
Calibration	Confidence aligns with actual accuracy	Brier score, ECE	Needs labelled outcomes; long-running
Citation validity	Citations exist and support the claim	Programmatic + judge	False citations score high on faithfulness if context contains them

2C. Retrieval Metrics (RAG-specific)¶

Metric	What It Measures	Range	Typical Threshold
Context Precision	Are the most-relevant chunks ranked highest?	0–1	> 0.7
Context Recall	Did retrieval pull all needed chunks?	0–1	> 0.8
Context Relevance	Are retrieved chunks actually relevant?	0–1	> 0.7
MRR (Mean Reciprocal Rank)	Position of first relevant result	0–1	> 0.7
nDCG	Ranking-aware quality of retrieval	0–1	> 0.8
Hit Rate @ k	Was a relevant chunk in top-k?	0/1 per query	> 0.9 @ k=5

2D. Agent-Specific Metrics¶

Metric	What It Measures	Implementation
Tool Correctness	Right tool selected?	Compare trace to expected tool set
Argument Correctness	Right arguments passed?	Schema + value assertions
Plan Quality	Are the steps sensible and minimal?	LLM-as-judge or human review
Task Completion	Did the agent finish the task?	Goal-state oracle; LLM-as-judge
Trajectory Efficiency	Steps / tokens / cost vs minimum needed	Trace metrics vs golden trajectory
Recovery Rate	When a tool fails, does the agent recover?	Inject failures, measure outcomes
Goal-state Accuracy	Did the final state match the requested state?	State diff
Latency / Cost per task	Time and tokens per completed task	Trace-level capture
Self-consistency	Same task ⇒ same outcome across runs	Multi-sample agreement

2E. Operational / Production Metrics¶

Metric	What It Measures	Why It Matters
Latency p50 / p95 / p99	Response time distribution	UX; SLA
Token cost per request	Input + output tokens × price	Budget; ROI
Throughput / QPS	Concurrent capacity	Scale planning
Error rate	API / tool / guardrail failures per minute	Reliability
Refusal rate	% of requests refused	Over-refusal detection
User feedback signals	Thumbs-up/down, regenerate clicks, conversation length	Real-world signal
Drift score	Distribution shift in inputs or outputs over time	Early warning

3. The Decision Matrix — Picking the Right Metric¶

When asked "how would you evaluate X?", your structured answer comes from this matrix.

3A. By task type¶

Task Type	Primary Metrics	Secondary / Supporting
Open-ended Q&A	Answer relevance, faithfulness, hallucination	Coherence, citation validity
RAG-based Q&A	All of the above + context precision/recall	Latency, cost
Summarisation	Faithfulness, coverage, conciseness	ROUGE-L for legacy comparison
Translation	BERTScore, COMET (modern)	BLEU/METEOR (legacy)
Code generation	Functional correctness (run tests), pass@k	Compile rate, complexity
Structured extraction	Field-level exact match, schema conformance	F1 per field, hallucination
Classification	Accuracy, macro-F1, calibration	Confusion matrix per class
Chat / multi-turn	Conversational relevance, knowledge retention, completeness	Latency, refusal correctness
Agent / tool use	Tool correctness, task completion	Trajectory efficiency, recovery rate
Safety / refusal	Refusal correctness, over-refusal rate	Bias, toxicity, jailbreak rate

3B. By evaluation method¶

Method	When to Use	Strengths	Weaknesses
Deterministic check (exact match, schema, regex)	Output has structure	Fast, free, reliable	Misses paraphrase
Lexical metric (BLEU/ROUGE)	Legacy comparison; cheap baseline	Cheap, well-known	Paraphrase blindness
Embedding similarity	Semantic comparison, no labels	Cheap, captures paraphrase	Threshold is arbitrary
LLM-as-judge	Subjective qualities, no ground truth	Captures nuance	Cost, bias, non-determinism
Multi-judge consensus	High-stakes assessment	Reduces single-judge bias	Higher cost
Human evaluation	Final calibration, edge cases	Gold standard	Slow, expensive, doesn't scale
Programmatic check (citations exist, length budget)	Behavioural assertions	Cheap, reliable	Limited coverage
A/B test in production	Real-world signal	Captures user impact	Slow; needs traffic; product-level only

3C. By stage of development¶

Stage	Primary Eval	Tooling
Pre-development / spec	Define metrics + thresholds before building	Spreadsheet / Notion
Development	Unit-style eval on golden datasets	DeepEval / pytest
Integration	End-to-end on representative inputs	Ragas / DeepEval / custom
Pre-release	Full eval suite + adversarial / red-team	DeepEval Red Teamer, PyRIT, Garak
Post-release	Monitoring + drift + user feedback	Langfuse / LangSmith / Arize
Continuous improvement	A/B testing, fine-tune iteration	LangSmith / Braintrust / custom

4. Quality Attribute Matrix — Mapping ISO/IEC 25010 to AI Systems¶

Auditors love this framing. Most AI candidates don't have it.

Attribute	Traditional Software	AI System Equivalent
Functional suitability	Feature works to spec	Accuracy, completeness, appropriateness — measured by eval suite
Reliability	MTBF, fault tolerance	Consistency across runs, refusal correctness, recovery from tool failure
Performance efficiency	Latency, throughput	Latency p50/p95/p99, tokens-per-request, cost-per-task
Compatibility	Co-existence, interoperability	Model/API version compatibility, MCP conformance
Usability	UX, accessibility	Coherence, fluency, helpfulness, refusal phrasing
Security	Confidentiality, integrity, authenticity	Prompt-injection resilience, PII leakage, auth boundary enforcement
Maintainability	Modularity, modifiability	Prompt versioning, eval reproducibility, regression coverage
Portability	Adaptability, installability	Provider abstraction (LiteLLM), model-swap resilience

Interview line: "I map AI quality work onto the same ISO 25010 attributes auditors use for traditional software — it gives the audit team a frame they recognise and ensures we cover security and reliability, not just functional correctness."

5. Risk → Test-Category Matrix¶

The most useful matrix for designing a test programme. Maps known failure classes to coverage.

Risk	Test Category	Example Assertion
Hallucination	Faithfulness eval, citation validity	Faithfulness > 0.85 on golden set
Stale knowledge	RAG retrieval refresh; date-bound queries	Answers for post-cutoff dates use retrieval
Prompt injection (direct)	Adversarial input corpus	System ignores `IGNORE PREVIOUS` patterns
Prompt injection (indirect)	Poisoned-document corpus in RAG	Embedded instructions in retrieved docs not executed
Jailbreak	Categorised jailbreak corpus	Model refuses across role-play / hypothetical / encoded variants
Bias	Counterfactual templates per protected attribute	Output behaviour invariant across attribute
Toxicity	Adversarial generations targeting harm	Toxicity classifier < threshold
PII leakage	Probes attempting data extraction	No PII in output to unauthorised user
Authority escalation	Agent permission tests	Agent refuses tool requiring elevated scope
Tool misuse	Trace-level assertions on tool selection	Right tool, right args, right order
Cascading hallucination	Multi-step trace eval	Step N output supported by step N-1 result
Latency amplification	Trace latency budget per workflow	p95 < N seconds end-to-end
Cost amplification	Token budget per workflow	Tokens per task < N
Drift	Continuous golden-set rerun	Metric delta vs baseline < tolerance
Refusal regression	Benign-query refusal rate	Over-refusal < 2% on benign queries
Schema drift	Output schema conformance	Pydantic validation passes 100%

6. LLM-as-Judge — The Critical Sub-Skill¶

LLM-as-judge appears in almost every metric above. Treat it as a system you test, not a tool you trust.

Judge calibration checklist¶

Step	Why It Matters
Pick a strong judge (≥ GPT-4 class)	Weak judges introduce noise indistinguishable from real signal
Use a different model than the generator	Same-model judging inflates scores
Validate the judge against human labels on a sample	Without this you don't know how reliable the score is
Calibrate the rubric — explicit criteria, not vibes	Vague rubrics = unstable scores
Pin temperature = 0 for the judge	Reduces run-to-run noise
Use chain-of-thought before the verdict	Improves judge accuracy on subtle cases
Multi-judge consensus for high-stakes assessment	Three judges + majority vote beats one
Re-validate when models update	Judges drift too

Judge anti-patterns to call out in interview¶

Anti-pattern	Why It's Bad
Single judge, same model as generator	Inflated scores, no second opinion
Boolean verdict ("good / bad") with no reasoning	Lost signal; hard to debug
No human validation sample	Operating blind
Judge prompt mixed with generation prompt	Coupled changes, hard to debug regressions
Trusting the score over rerunning on a labelled subset	Confidence without calibration

7. Framework Comparison Matrix¶

When asked "which framework would you use?", this is the structured answer.

7A. Open-source evaluation frameworks¶

Framework	Style	Strength	Weakness	Best Fit
Ragas	Dataset-batch	RAG-metric coverage	Less ergonomic for behavioural tests	RAG-heavy projects
DeepEval	pytest-native	Broad — RAG, safety, agents, red-team	Cost adds up at scale	Teams wanting CI-integrated AI tests
TruLens	Feedback functions	Programmable evaluators	Smaller ecosystem	Custom metric needs
Promptfoo	YAML-config CLI	Easy CI integration + red-team	Less Python flexibility	Polyglot teams
OpenAI Evals	Eval registry	Standard format; community evals	OpenAI-leaning	Pre-existing OpenAI shop
LangChain Evaluators	Chain-native	Tight LangChain fit	Lock-in	LangChain projects
Arize Phoenix Evals	Template-driven	Pairs with tracing	Less standalone	Already-on-Arize teams

7B. Observability / tracing platforms¶

Platform	Open Source?	Tracing	Eval	Dataset Mgmt	Self-Host
Langfuse	Yes	✓✓	✓	✓	✓
LangSmith	No (cloud)	✓✓	✓	✓	✗
Arize Phoenix / AX	Phoenix yes; AX no	✓✓	✓	✓	Phoenix only
Braintrust	No	✓	✓✓	✓✓	✗
Helicone	Yes	✓	basic	✗	✓
W&B Weave	No	✓	✓	✓	✗
OpenTelemetry GenAI	Yes (spec)	✓✓	via integration	via integration	✓

Interview line: "I'd anchor on an open-source primitive — Ragas or DeepEval for metrics, Langfuse for tracing — because eval logic belongs in version control. Commercial layers earn their place when dashboards, dataset versioning, or non-engineer collaboration become bottlenecks."

7C. Red-team & security frameworks¶

Tool	Type	Strength
PyRIT	Microsoft Python framework	Build custom red-team campaigns
Garak	NVIDIA CLI scanner	100+ probes; broad baseline
AgentDojo	Academic benchmark	Agent prompt-injection robustness
Promptfoo (red-team mode)	YAML + CLI	OWASP LLM Top 10 coverage
DeepEval Red Teamer	Python	50+ vulnerability categories in code
Lakera Red	Commercial	Hosted, runtime + offline
Mindgard	Commercial	Continuous adversarial testing SaaS

See red-blue-purple-team-ai-faq.md for theory; commercial-llm-mcp-testing-tools.md for the full vendor landscape.

8. Threshold-Setting Matrix¶

The "what number do I put in CI" question. There's no universal answer — but there's a defensible method.

Approach	How It Works	When to Use
Baseline + safety margin	Measure current model + prompt on golden set; threshold = baseline minus tolerance	Most common; honest
Reference-best minus delta	Threshold relative to known-best (GPT-4 etc.)	When you're optimising for parity
Business-derived	Backwards from product KPI (e.g. "we need 90% deflection to break even")	Strong product ties
Adversarial absolute	Safety / refusal tests: must pass 100% on critical categories	Safety gates
Distribution-based	Pass if score distribution overlap with baseline > X	When point estimates are noisy
Human-validated calibration	Threshold corresponds to "would a reviewer accept this" rate	High-stakes content

Anti-pattern: picking thresholds from intuition. "It should be > 0.8" — based on what? Always derive from data.

9. Tiering Matrix — Findings & Gates¶

Critical for Lead / senior roles. Without tiers, every issue is "P1" and nothing moves.

Tier	Definition	Action	Examples
Critical	Safety, PII leakage, authority escalation, regulator-relevant policy violation	Blocks release	Jailbreak success, PII in output, prompt injection bypassing guardrail
High	Quality regression on a measured metric beyond tolerance	Waivable with explicit sign-off	Faithfulness drop > 5%, latency p95 > budget
Medium	Quality regression within tolerance; new edge case identified	Tracked; fix in next sprint	Marginal score drop, new niche failure mode
Low	Cosmetic / non-functional	Backlog	Phrasing preference, minor fluency drop

Interview line: "Tiers aren't a vibe — they're a policy. Published, signed off, defended. Without them everything is critical and nothing actually moves."

10. Rapid-Fire Q&A — Interview-Ready¶

Conceptual¶

Q: What's the single most important LLM metric? None. Every interesting evaluation stacks multiple signals. The honest answer if pressed: faithfulness for RAG, task completion for agents, refusal correctness for safety.
Q: What's wrong with BLEU and ROUGE for LLM eval? Paraphrase blindness. They reward surface overlap with a reference, which penalises legitimately rephrased correct answers. Use BERTScore or LLM-as-judge for semantic comparison.
Q: How do you evaluate without ground-truth labels? Reference-free metrics — faithfulness, answer relevance, coherence — using LLM-as-judge with a calibrated rubric, paired with deterministic checks (citations, schema, length).
Q: How do you stop LLM-as-judge from being a flaky test source? Temperature zero on the judge, a different model from the generator, chain-of-thought in the rubric, multi-judge consensus for high stakes, periodic human calibration, version-pin the judge.

RAG-specific¶

Q: What does Ragas measure that DeepEval doesn't, or vice versa? Mostly overlap. Ragas is dataset-batch ergonomic for RAG metric reporting. DeepEval is pytest-style and broader (safety, agents, red-team). Many teams use both.
Q: Faithfulness vs answer relevance? Faithfulness = the answer is grounded in retrieved context (didn't make things up). Answer relevance = the answer is about the question. They fail independently.
Q: Why measure retrieval and generation separately? Bad retrieval poisons even good generation. Good generation can still misuse retrieval. Localising regressions requires isolating each layer.

Agent-specific¶

Q: How do you test an agent end-to-end? Trace-level assertions on the tool-call sequence (which tools, what args, what order), plus task-completion oracle, plus latency/cost budgets. The trace matters as much as the final answer.
Q: What failure modes are unique to agentic systems? Tool misuse, cascading hallucination, latency amplification, cost amplification, authority creep, infinite reasoning loops, indirect injection through tool outputs.
Q: How do you handle non-deterministic agent traces? Assert on properties not exact paths. Required tools must be present; arguments must satisfy schema; outcome must match goal-state. The exact path may vary.

Process¶

Q: How do you decide when an AI feature is "ready to ship"? Threshold-met on the eval suite, safety gates passed at 100%, performance and cost within budget, a documented residual-risk waiver if any, and a rollback plan if metrics drift in production.
Q: How do you handle a model upgrade (e.g. GPT-4 → GPT-5)? Full eval suite rerun on the new model with the old prompts; investigate any metric delta beyond tolerance; A/B test in production on a percentage of traffic; only promote if business KPIs hold.
Q: How do you build an eval golden set? Start from real user queries; add edge cases (ambiguous, multi-hop, no-answer-available, adversarial); aim for coverage of failure modes, not quantity; version the dataset alongside the model.

11. Common Anti-Patterns¶

Anti-pattern	Why It's Bad	Fix
Single-metric eval ("faithfulness is 0.85, ship it")	One number masks failure modes	Stack 3+ signals; report per-category
Same model as judge and generator	Inflated scores	Use a different (stronger) judge
No human calibration of LLM-as-judge	Unknown reliability	Validate on a labelled sample
Thresholds picked from intuition	Indefensible at audit	Derive from baseline + safety margin
Eval suite that never fails	Suspect — likely undersized	Adversarial seeding; coverage map
Pass/fail only — no trend over time	Misses drift	Plot metrics; alert on deltas
Tests written after the prompt	Tests pass because they encode current behaviour	Write expected behaviour first
Mixing eval and observability	Confused signals	Eval = pre-release; observability = production; separate stores
Trusting one judge model	Single-point bias	Multi-judge or human spot checks

12. Cross-References¶

RAG metric deep dive → ragas-faq.md
pytest-style framework → deepeval-faq.md
Adversarial / red-team theory → red-blue-purple-team-ai-faq.md
MCP-specific testing → mcp-servers-faq.md, mcp-testing-roadmap.md
Architecture context → rag-vs-agents-vs-agentic-rag.md
Platform-specific testing → enterprise-llm-platforms.md
Tool landscape → commercial-llm-mcp-testing-tools.md
Lifecycle (pre-prod → prod) → llm-testing-lifecycle.md

13. Master Interview Sound-Bites¶

"Evaluation has three layers — model, system, product. Most failures live at the system layer; most metrics live at the model layer. Bridging that gap is where QE adds value."
"No single metric is enough. I stack at least three signals — LLM-as-judge for nuance, deterministic checks for what can be checked, and citation validity for grounded outputs."
"Thresholds aren't intuition — they're derived. Baseline plus safety margin; safety categories at 100%. Anything else is indefensible at audit."
"I tier findings explicitly — critical blocks release, high is waivable with sign-off, medium is tracked. Without tiers everything is P1 and nothing moves."
"For agents the trace matters as much as the answer. A right answer reached through the wrong tools is still a quality defect — cost, latency, audit."
"LLM-as-judge is a system I test, not a tool I trust. Strong judge, different model from the generator, validated against humans, multi-judge for high stakes."