DeepEval — FAQ & Quick Reference¶

DeepEval is an open-source Python framework for LLM and RAG evaluation, designed to feel like pytest. Maintained by Confident AI. Library: pip install deepeval.

Fundamentals¶

Q: What is DeepEval? A pytest-style evaluation framework for LLM applications. Treats LLM tests like unit tests — you write assert_test() calls with metrics and thresholds, and they pass or fail in CI. Covers RAG metrics, behavioural/safety, agent evaluation, and red-teaming. Has a commercial cloud companion (Confident AI / DeepEval Cloud) for tracking runs and datasets.

Q: How does DeepEval differ from Ragas? | | DeepEval | Ragas | |---|---|---| | Style | pytest-native | Dataset-batch | | Strength | Single-test assertions, CI-first ergonomics, red-teaming | Aggregate RAG-metric reporting | | Metric set | Broader (G-Eval, hallucination, bias, toxicity, agents, conversational) | RAG-focused with growing agent support | | Commercial layer | Confident AI (paid cloud) | Ragas App |

Most teams I'd recommend using DeepEval for behavioural/safety/agent tests and Ragas for aggregate RAG metrics — they overlap but each has the cleaner API for its strength.

Core Metrics¶

Metric	Category	What It Measures
G-Eval	Custom	LLM-as-judge against a user-defined evaluation criterion (most flexible)
Hallucination	RAG/Gen	Are claims supported by provided context?
Faithfulness	RAG	Same idea, RAG-specific
Answer Relevancy	Gen	Does the answer address the question?
Contextual Precision / Recall / Relevancy	Retrieval	Three retrieval-quality dimensions
Bias	Safety	Demographic/political/ideological bias in the output
Toxicity	Safety	Harmful, offensive, or unsafe content
Summarization	Specialised	Alignment + inclusion against a source text
Tool Correctness	Agentic	Did the agent call the right tools?
Task Completion	Agentic	Did the agent achieve the goal?
Conversational Relevancy / Completeness / Knowledge Retention	Chat	Multi-turn conversation metrics
Red-Teaming Vulnerabilities	Safety	50+ attack categories — prompt injection, PII leakage, jailbreaks, etc.

Implementation¶

Q: What's the minimal setup? ```python from deepeval import assert_test from deepeval.test_case import LLMTestCase from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric

def test_rag_answer(): test_case = LLMTestCase( input="What's our refund policy?", actual_output=llm_response, retrieval_context=retrieved_chunks, ) assert_test(test_case, [ HallucinationMetric(threshold=0.3), # lower = better AnswerRelevancyMetric(threshold=0.7), # higher = better ]) `` Then run with:deepeval test run test_rag.py(or plainpytest`).

Q: What's G-Eval? The most powerful metric — define an evaluation criterion in plain English and DeepEval builds an LLM-judge chain-of-thought around it. Use when no built-in metric fits.

```python from deepeval.metrics import GEval from deepeval.test_case import LLMTestCaseParams

professionalism = GEval( name="Professionalism", criteria="The answer is polite, neutral, and avoids slang.", evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT], threshold=0.7, ) ```

Q: How do I configure the judge model? Any LiteLLM-supported model. Default is OpenAI; configure via env var or model= parameter on each metric. Local Ollama models work out of the box.

python HallucinationMetric(threshold=0.3, model="ollama/llama3.1:8b")

Red-Teaming (the killer feature)¶

Q: What does DeepEval's red-teaming module do? It generates adversarial inputs across configurable vulnerability categories (prompt injection, PII, jailbreak, bias, toxicity, illegal activity, etc.) and runs them against your LLM application, then scores responses for compliance.

```python from deepeval.red_teaming import RedTeamer from deepeval.red_teaming.types import Vulnerability, AttackEnhancement

red_teamer = RedTeamer( target_purpose="Answer customer-service questions for a bank", target_system_prompt=system_prompt, ) results = red_teamer.scan( target_model_callback=my_llm_callback, attacks_per_vulnerability=10, vulnerabilities=[Vulnerability.PROMPT_INJECTION, Vulnerability.PII_LEAKAGE], attack_enhancements={AttackEnhancement.JAILBREAK_LINEAR: 0.5}, ) ```

This is the easiest off-the-shelf way to get a behavioural / safety test suite started.

Agentic & Conversational¶

Q: How does DeepEval test agents? Two main metrics: ToolCorrectnessMetric (did the agent call the expected tools with the expected arguments?) and TaskCompletionMetric (did it achieve the goal?). For multi-step traces, you assert on a trace object that captures the full tool-call sequence.

Q: How does it test multi-turn chat? ConversationalTestCase carries a list of turns. Metrics include ConversationalRelevancyMetric, ConversationalCompletenessMetric, and KnowledgeRetentionMetric (does the bot remember what the user said three turns ago).

Operational¶

Q: How do I integrate with pytest and CI? DeepEval is pytest under the hood — deepeval test run is a pytest wrapper. Just run pytest. Mark expensive evals with @pytest.mark.slow. Use assert_test to fail builds on threshold misses.

Q: How do I track results over time? Two options: (1) log to Confident AI (paid cloud) for dashboards and dataset versioning, or (2) export results to JSONL and build your own trend dashboards. The free path works fine for most teams.

Q: Synthetic dataset generation? DeepEval has a Synthesizer that generates eval datasets from your documents — pairs of (question, expected answer, expected context). Useful to bootstrap a golden set quickly. Always human-review samples — synthetic-only is risky.

python from deepeval.synthesizer import Synthesizer synth = Synthesizer() dataset = synth.generate_goldens_from_docs(document_paths=["policy.pdf"], max_goldens_per_context=2)

Gotchas¶

Threshold direction — hallucination/bias/toxicity: lower is better. Relevancy/faithfulness: higher is better. Easy to get wrong; check the metric docstring.
LLM-judge cost — every metric is one or more LLM calls per test case. Use cheap judges in CI, stronger judges for release gates.
Determinism — same caveats as Ragas. Run multiple samples for high-stakes assertions.
Version pinning — metric internals shift across versions; pin in CI.

Useful Links¶

Docs: https://deepeval.com/docs
GitHub: https://github.com/confident-ai/deepeval
Confident AI (commercial dashboard): https://confident-ai.com

Interview Sound-Bites¶

"DeepEval's strength is the pytest ergonomics — every eval is a normal test, normal CI, normal reporting. That makes adoption easy for an existing QA team."
"The red-teaming module is the fastest way to bootstrap a behavioural test suite — 50+ vulnerability categories out of the box, then you tune to your domain."
"For agents, I use ToolCorrectnessMetric for trace-level assertions and TaskCompletionMetric for end-to-end — together they catch both 'right answer, wrong path' and 'wrong answer' regressions."