Evaluation in RAG
1. Why RAG Evaluation Is Hard
RAG has two components that can each fail independently — the retriever and the generator. A correct-sounding answer may be unfaithful to the retrieved context; a relevant chunk may still not be used correctly.
Three failure modes: - Retriever returns wrong/irrelevant chunks (retrieval failure) - Generator ignores or misuses the correct chunk (generation failure) - Correct chunk exists in the corpus but isn’t retrieved (coverage failure)
Evaluation must cover all three layers separately and end-to-end.
2. Retrieval Metrics
These metrics evaluate how good the retrieved chunks are, independent of generation.
| Metric | What It Measures |
|---|---|
| Precision@K | Fraction of top-K retrieved chunks that are relevant |
| Recall@K | Fraction of all relevant chunks that appear in top-K |
| MRR (Mean Reciprocal Rank) | How high the first relevant chunk is ranked |
| NDCG (Normalized Discounted Cumulative Gain) | Ranks weighted by position and relevance grade |
| Context Relevance (RAGAS) | LLM-judged: are the retrieved chunks relevant to the question? |
Practical minimum: track Recall@K (did we retrieve the chunk that contains the answer?) and Context Relevance.
3. Generation Metrics
These metrics evaluate how good the generated answer is given the retrieved context.
| Metric | What It Measures |
|---|---|
| Faithfulness (RAGAS) | Is every claim in the answer supported by the retrieved context? |
| Answer Relevance (RAGAS) | Does the answer actually address the question asked? |
| Answer Correctness | Is the answer factually correct against a ground truth? |
| Completeness | Does the answer cover all aspects of the question? |
| ROUGE / BLEU | Lexical overlap with a reference answer (weak signal for open-ended QA) |
Key insight: Faithfulness and Answer Relevance are the two most actionable RAGAS metrics. A low faithfulness score points to generation failure; a low answer relevance score often points to retrieval failure.
4. RAGAS Framework
RAGAS (Retrieval Augmented Generation Assessment) is the most widely used framework for RAG evaluation. It computes LLM-judged metrics without requiring human-labelled answers for every question.
Core RAGAS metrics:
| Metric | Formula Intuition |
|---|---|
| Faithfulness | # claims in answer supported by context / # total claims in answer |
| Answer Relevance | Reverse-engineer questions from the answer; cosine similarity to original question |
| Context Precision | Proportion of retrieved chunks that are relevant |
| Context Recall | Proportion of ground-truth statements covered by retrieved chunks |
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
data = {
"question": ["What is the capital of France?"],
"answer": ["Paris is the capital of France."],
"contexts": [["France is a country in Europe. Its capital is Paris."]],
"ground_truth": ["Paris"],
}
result = evaluate(Dataset.from_dict(data), metrics=[faithfulness, answer_relevancy])
print(result)5. Building a Test Set
A good evaluation test set is the foundation of trustworthy metrics.
Construction approaches:
| Approach | Description | Cost |
|---|---|---|
| Manual curation | Human experts write Q&A pairs | High quality, expensive |
| Synthetic generation | LLM generates (question, answer, context) triples from your corpus | Fast, scalable |
| User query mining | Sample real production queries and label answers | Reflects actual use |
| Adversarial | Craft hard questions: multi-hop, negations, out-of-scope | Tests edge cases |
RAGAS TestsetGenerator:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
generator = TestsetGenerator.with_openai()
testset = generator.generate_with_langchain_docs(
documents,
test_size=50,
distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
)Rule of thumb: 100–200 diverse questions is enough to track meaningful regressions for most production pipelines.
6. Offline vs. Online Evaluation
| Mode | When | What |
|---|---|---|
| Offline (batch) | Pre-deployment / CI | Run test set through pipeline, compute metrics, compare to baseline |
| Online (production) | Post-deployment | Sample live traffic, score with LLM judge, track dashboards |
Offline evaluation checklist: - Pin all versions (embedding model, retriever config, LLM, prompt template) - Log raw inputs/outputs alongside metrics for debugging - Set a minimum acceptable threshold per metric (e.g., Faithfulness ≥ 0.85) - Fail CI if any metric regresses beyond a tolerance (e.g., ≥ 5% relative drop)
Online sampling: - Don’t score 100% of traffic — sample 5–10% to control LLM judge cost - Use stratified sampling to cover rare query types
7. LLM-as-Judge
Using a strong LLM (e.g., GPT-4o) to judge outputs is now standard because it scales better than human annotation and correlates well with human judgment.
Prompt pattern:
You are an expert evaluator. Given the question, retrieved context, and answer below,
rate the faithfulness of the answer on a scale of 1–5.
Question: {question}
Context: {context}
Answer: {answer}
Faithfulness score (1=completely unfaithful, 5=fully supported):
Pitfalls: - Judge LLM can be biased toward longer or more confident-sounding answers - Self-consistency: same judge returns different scores on re-runs → average over 3 runs - Use a different model as judge than as generator to avoid self-preference bias
8. Regression Tracking
Treat evaluation like a software test suite: every pipeline change should be validated against the baseline before shipping.
Workflow:
1. Commit pipeline change (new embedding model, chunk size, prompt, etc.)
2. Run eval suite on pinned test set
3. Compare metric deltas to baseline
4. If any metric drops > threshold → block merge
5. If all metrics pass → update baseline, merge
Tools: - MLflow / W&B — log eval runs, compare experiments, track metric history - RAGAS CI integration — run ragas evaluate in GitHub Actions - LangSmith / Phoenix — built-in dataset + eval runner with regression dashboards
Baseline storage:
{
"version": "2024-11-01",
"faithfulness": 0.91,
"answer_relevancy": 0.87,
"context_recall": 0.79
}9. End-to-End vs. Component Evaluation
| Level | What to Measure | When to Use |
|---|---|---|
| Component: Retriever | Recall@K, Precision@K | Optimising chunk size, embedding model, top-K |
| Component: Reranker | NDCG, precision delta vs. no reranker | Evaluating reranker effectiveness |
| Component: Generator | Faithfulness, Answer Relevance | Optimising prompts, models |
| End-to-end | Answer Correctness, user satisfaction | Validating overall pipeline |
Tip: Start with end-to-end. If overall quality is poor, drill into components to isolate the bottleneck.
Summary
| Metric | Scope | Requires ground truth? |
|---|---|---|
| Context Relevance | Retriever | No (LLM judge) |
| Faithfulness | Generator | No (LLM judge) |
| Answer Relevance | Generator | No (LLM judge) |
| Context Recall | Retriever | Yes |
| Answer Correctness | End-to-end | Yes |
The no-reference metrics (Faithfulness, Answer Relevance, Context Relevance) are your day-to-day production monitors. The reference-based metrics are for periodic deep-dives with curated test sets.