Ragas — FAQ & Quick Reference¶
Ragas (Retrieval-Augmented Generation Assessment) is an open-source Python framework for evaluating RAG pipelines. Maintained by Exploding Gradients. Library:
pip install ragas.
Fundamentals¶
Q: What is Ragas? A Python evaluation framework for RAG systems. It provides reference-free and reference-based metrics for retrieval quality and generation quality, built on top of LLM-as-judge plus deterministic checks. Integrates with LangChain, LlamaIndex, and raw HuggingFace datasets.
Q: When should I use Ragas vs writing my own evaluator? Use Ragas when its built-in metrics map cleanly to what you need (faithfulness, answer relevance, context precision/recall). Roll your own when you need domain-specific metrics, citation-presence checks, or tighter control over the judge prompt. In practice, most teams use both — Ragas for the standard set, custom evaluators for the rest.
Q: Does Ragas need ground-truth answers? Some metrics do, some don't. Reference-free: faithfulness, answer relevance, context relevance. Reference-based (need a ground-truth answer or context): answer correctness, context precision, context recall. You can run reference-free metrics on production traffic; reference-based ones need a labelled set.
Core Metrics¶
| Metric | Type | What It Measures | Needs Ground Truth? |
|---|---|---|---|
| Faithfulness | Generation | Are the claims in the answer supported by the retrieved context? | No |
| Answer Relevance | Generation | Does the answer actually address the question? | No |
| Context Relevance | Retrieval | Is the retrieved context relevant to the question? | No |
| Context Precision | Retrieval | Are the most relevant chunks ranked highest? | Yes (ground-truth context) |
| Context Recall | Retrieval | Did retrieval pull all the chunks needed to answer? | Yes (ground-truth context) |
| Answer Correctness | End-to-end | How close is the answer to the labelled correct answer (semantic + factual)? | Yes |
| Answer Similarity | End-to-end | Semantic similarity to the ground-truth answer | Yes |
| Aspect Critique | Custom | Score the answer against a user-defined criterion (harmfulness, conciseness, etc.) | No |
Implementation¶
Q: What's the minimal setup? ```python from datasets import Dataset from ragas import evaluate from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
data = { "question": [...], "answer": [...], "contexts": [[...], [...]], # list of retrieved chunks per question "ground_truth": [...], # only needed for reference-based metrics } result = evaluate(Dataset.from_dict(data), metrics=[faithfulness, answer_relevancy]) print(result) ```
Q: Which LLM does Ragas use as judge?
Configurable. Default is OpenAI (gpt-4o-mini or similar in recent versions). You can swap in any LangChain-compatible model — Anthropic, Azure OpenAI, local via Ollama, Bedrock, Vertex. Set via RunConfig or pass model objects directly.
Q: How do I use a local model (Ollama) as the judge? ```python from langchain_community.chat_models import ChatOllama from ragas.llms import LangchainLLMWrapper
judge = LangchainLLMWrapper(ChatOllama(model="llama3.1:8b")) faithfulness.llm = judge ``` Caveat: weaker judges produce noisier scores. Validate against a stronger model on a sample first.
Q: How do I integrate Ragas with pytest?
Run evaluate() in the test, assert thresholds on the returned scores. Treat each metric as a budget — fail the test if it drops below X. Mark expensive runs with @pytest.mark.slow so they only run in CI not on local dev.
Operational¶
Q: How expensive is a Ragas run?
Faithfulness and context-precision are token-heavy — they make one or more LLM calls per answer claim. A 100-question eval with gpt-4o-mini typically costs single-digit dollars; with gpt-4o or Claude Opus, an order of magnitude more. Budget accordingly and consider cheap judge models for CI plus a stronger judge for release gates.
Q: How do I make Ragas deterministic? You can't fully — LLM-as-judge introduces variance. Mitigations: pin temperature=0, use a fixed seed where the provider supports it, run multiple samples and average, and assert on score distributions rather than point values. For CI gates, use thresholds with safety margins.
Q: How does Ragas handle multi-hop questions? Context recall and context precision are computed per ground-truth chunk, so multi-hop is fine as long as your labelled set captures all the chunks needed. Faithfulness still works because it decomposes the answer into claims and checks each against the full retrieved context.
Q: Can Ragas evaluate agentic / tool-using systems? Partially. Ragas v0.2+ added agent-evaluation metrics (tool-call accuracy, agent goal accuracy, topic adherence). But for trace-level testing — which tools were called in what order — you still typically build custom assertions on top.
Gotchas¶
- Score drift between Ragas versions — metric definitions and judge prompts change between releases. Pin the version (
ragas==0.2.x) for reproducible eval reports. - LLM-as-judge bias — same model as judge and generator inflates scores. Use a different (ideally stronger) model as judge.
- Empty retrievals — faithfulness with no context can give odd results. Filter or handle separately.
- Cost surprises — context-precision is one LLM call per (chunk × question). A 500-question, 10-chunk eval is 5,000 judge calls.
Useful Links¶
- Docs: https://docs.ragas.io/
- GitHub: https://github.com/explodinggradients/ragas
- Commercial offering: Ragas App (hosted UI for managing eval runs)
Interview Sound-Bites¶
- "Ragas covers the standard RAG metric set well — I use it for faithfulness, answer relevance, and context precision/recall, then layer custom evaluators on top for things like citation validity that Ragas doesn't ship."
- "I always run a different judge model than the generator — same-model judging inflates scores systematically."
- "Pin the version. Metric definitions evolve and your historical numbers stop comparing cleanly otherwise."