DeepEval — FAQ & Quick Reference¶
DeepEval is an open-source Python framework for LLM and RAG evaluation, designed to feel like pytest. Maintained by Confident AI. Library:
pip install deepeval.
Fundamentals¶
Q: What is DeepEval?
A pytest-style evaluation framework for LLM applications. Treats LLM tests like unit tests — you write assert_test() calls with metrics and thresholds, and they pass or fail in CI. Covers RAG metrics, behavioural/safety, agent evaluation, and red-teaming. Has a commercial cloud companion (Confident AI / DeepEval Cloud) for tracking runs and datasets.
Q: How does DeepEval differ from Ragas? | | DeepEval | Ragas | |---|---|---| | Style | pytest-native | Dataset-batch | | Strength | Single-test assertions, CI-first ergonomics, red-teaming | Aggregate RAG-metric reporting | | Metric set | Broader (G-Eval, hallucination, bias, toxicity, agents, conversational) | RAG-focused with growing agent support | | Commercial layer | Confident AI (paid cloud) | Ragas App |
Most teams I'd recommend using DeepEval for behavioural/safety/agent tests and Ragas for aggregate RAG metrics — they overlap but each has the cleaner API for its strength.
Core Metrics¶
| Metric | Category | What It Measures |
|---|---|---|
| G-Eval | Custom | LLM-as-judge against a user-defined evaluation criterion (most flexible) |
| Hallucination | RAG/Gen | Are claims supported by provided context? |
| Faithfulness | RAG | Same idea, RAG-specific |
| Answer Relevancy | Gen | Does the answer address the question? |
| Contextual Precision / Recall / Relevancy | Retrieval | Three retrieval-quality dimensions |
| Bias | Safety | Demographic/political/ideological bias in the output |
| Toxicity | Safety | Harmful, offensive, or unsafe content |
| Summarization | Specialised | Alignment + inclusion against a source text |
| Tool Correctness | Agentic | Did the agent call the right tools? |
| Task Completion | Agentic | Did the agent achieve the goal? |
| Conversational Relevancy / Completeness / Knowledge Retention | Chat | Multi-turn conversation metrics |
| Red-Teaming Vulnerabilities | Safety | 50+ attack categories — prompt injection, PII leakage, jailbreaks, etc. |
Implementation¶
Q: What's the minimal setup? ```python from deepeval import assert_test from deepeval.test_case import LLMTestCase from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric
def test_rag_answer():
test_case = LLMTestCase(
input="What's our refund policy?",
actual_output=llm_response,
retrieval_context=retrieved_chunks,
)
assert_test(test_case, [
HallucinationMetric(threshold=0.3), # lower = better
AnswerRelevancyMetric(threshold=0.7), # higher = better
])
``
Then run with:deepeval test run test_rag.py(or plainpytest`).
Q: What's G-Eval? The most powerful metric — define an evaluation criterion in plain English and DeepEval builds an LLM-judge chain-of-thought around it. Use when no built-in metric fits.
```python from deepeval.metrics import GEval from deepeval.test_case import LLMTestCaseParams
professionalism = GEval( name="Professionalism", criteria="The answer is polite, neutral, and avoids slang.", evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT], threshold=0.7, ) ```
Q: How do I configure the judge model?
Any LiteLLM-supported model. Default is OpenAI; configure via env var or model= parameter on each metric. Local Ollama models work out of the box.
python
HallucinationMetric(threshold=0.3, model="ollama/llama3.1:8b")
Red-Teaming (the killer feature)¶
Q: What does DeepEval's red-teaming module do? It generates adversarial inputs across configurable vulnerability categories (prompt injection, PII, jailbreak, bias, toxicity, illegal activity, etc.) and runs them against your LLM application, then scores responses for compliance.
```python from deepeval.red_teaming import RedTeamer from deepeval.red_teaming.types import Vulnerability, AttackEnhancement
red_teamer = RedTeamer( target_purpose="Answer customer-service questions for a bank", target_system_prompt=system_prompt, ) results = red_teamer.scan( target_model_callback=my_llm_callback, attacks_per_vulnerability=10, vulnerabilities=[Vulnerability.PROMPT_INJECTION, Vulnerability.PII_LEAKAGE], attack_enhancements={AttackEnhancement.JAILBREAK_LINEAR: 0.5}, ) ```
This is the easiest off-the-shelf way to get a behavioural / safety test suite started.
Agentic & Conversational¶
Q: How does DeepEval test agents?
Two main metrics: ToolCorrectnessMetric (did the agent call the expected tools with the expected arguments?) and TaskCompletionMetric (did it achieve the goal?). For multi-step traces, you assert on a trace object that captures the full tool-call sequence.
Q: How does it test multi-turn chat?
ConversationalTestCase carries a list of turns. Metrics include ConversationalRelevancyMetric, ConversationalCompletenessMetric, and KnowledgeRetentionMetric (does the bot remember what the user said three turns ago).
Operational¶
Q: How do I integrate with pytest and CI?
DeepEval is pytest under the hood — deepeval test run is a pytest wrapper. Just run pytest. Mark expensive evals with @pytest.mark.slow. Use assert_test to fail builds on threshold misses.
Q: How do I track results over time? Two options: (1) log to Confident AI (paid cloud) for dashboards and dataset versioning, or (2) export results to JSONL and build your own trend dashboards. The free path works fine for most teams.
Q: Synthetic dataset generation?
DeepEval has a Synthesizer that generates eval datasets from your documents — pairs of (question, expected answer, expected context). Useful to bootstrap a golden set quickly. Always human-review samples — synthetic-only is risky.
python
from deepeval.synthesizer import Synthesizer
synth = Synthesizer()
dataset = synth.generate_goldens_from_docs(document_paths=["policy.pdf"], max_goldens_per_context=2)
Gotchas¶
- Threshold direction — hallucination/bias/toxicity: lower is better. Relevancy/faithfulness: higher is better. Easy to get wrong; check the metric docstring.
- LLM-judge cost — every metric is one or more LLM calls per test case. Use cheap judges in CI, stronger judges for release gates.
- Determinism — same caveats as Ragas. Run multiple samples for high-stakes assertions.
- Version pinning — metric internals shift across versions; pin in CI.
Useful Links¶
- Docs: https://deepeval.com/docs
- GitHub: https://github.com/confident-ai/deepeval
- Confident AI (commercial dashboard): https://confident-ai.com
Interview Sound-Bites¶
- "DeepEval's strength is the pytest ergonomics — every eval is a normal test, normal CI, normal reporting. That makes adoption easy for an existing QA team."
- "The red-teaming module is the fastest way to bootstrap a behavioural test suite — 50+ vulnerability categories out of the box, then you tune to your domain."
- "For agents, I use
ToolCorrectnessMetricfor trace-level assertions andTaskCompletionMetricfor end-to-end — together they catch both 'right answer, wrong path' and 'wrong answer' regressions."