Evaluation in RAG

Evaluation measures how well your RAG pipeline retrieves relevant context and generates faithful, accurate answers. It covers metric choice, test set construction, and regression tracking.
Author

Benedict Thekkel

1. Why RAG Evaluation Is Hard

RAG has two components that can each fail independently — the retriever and the generator. A correct-sounding answer may be unfaithful to the retrieved context; a relevant chunk may still not be used correctly.

Three failure modes: - Retriever returns wrong/irrelevant chunks (retrieval failure) - Generator ignores or misuses the correct chunk (generation failure) - Correct chunk exists in the corpus but isn’t retrieved (coverage failure)

Evaluation must cover all three layers separately and end-to-end.


2. Retrieval Metrics

These metrics evaluate how good the retrieved chunks are, independent of generation.

Metric What It Measures
Precision@K Fraction of top-K retrieved chunks that are relevant
Recall@K Fraction of all relevant chunks that appear in top-K
MRR (Mean Reciprocal Rank) How high the first relevant chunk is ranked
NDCG (Normalized Discounted Cumulative Gain) Ranks weighted by position and relevance grade
Context Relevance (RAGAS) LLM-judged: are the retrieved chunks relevant to the question?

Practical minimum: track Recall@K (did we retrieve the chunk that contains the answer?) and Context Relevance.


3. Generation Metrics

These metrics evaluate how good the generated answer is given the retrieved context.

Metric What It Measures
Faithfulness (RAGAS) Is every claim in the answer supported by the retrieved context?
Answer Relevance (RAGAS) Does the answer actually address the question asked?
Answer Correctness Is the answer factually correct against a ground truth?
Completeness Does the answer cover all aspects of the question?
ROUGE / BLEU Lexical overlap with a reference answer (weak signal for open-ended QA)

Key insight: Faithfulness and Answer Relevance are the two most actionable RAGAS metrics. A low faithfulness score points to generation failure; a low answer relevance score often points to retrieval failure.


4. RAGAS Framework

RAGAS (Retrieval Augmented Generation Assessment) is the most widely used framework for RAG evaluation. It computes LLM-judged metrics without requiring human-labelled answers for every question.

Core RAGAS metrics:

Metric Formula Intuition
Faithfulness # claims in answer supported by context / # total claims in answer
Answer Relevance Reverse-engineer questions from the answer; cosine similarity to original question
Context Precision Proportion of retrieved chunks that are relevant
Context Recall Proportion of ground-truth statements covered by retrieved chunks
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

data = {
    "question": ["What is the capital of France?"],
    "answer": ["Paris is the capital of France."],
    "contexts": [["France is a country in Europe. Its capital is Paris."]],
    "ground_truth": ["Paris"],
}

result = evaluate(Dataset.from_dict(data), metrics=[faithfulness, answer_relevancy])
print(result)

5. Building a Test Set

A good evaluation test set is the foundation of trustworthy metrics.

Construction approaches:

Approach Description Cost
Manual curation Human experts write Q&A pairs High quality, expensive
Synthetic generation LLM generates (question, answer, context) triples from your corpus Fast, scalable
User query mining Sample real production queries and label answers Reflects actual use
Adversarial Craft hard questions: multi-hop, negations, out-of-scope Tests edge cases

RAGAS TestsetGenerator:

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

generator = TestsetGenerator.with_openai()
testset = generator.generate_with_langchain_docs(
    documents,
    test_size=50,
    distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
)

Rule of thumb: 100–200 diverse questions is enough to track meaningful regressions for most production pipelines.


6. Offline vs. Online Evaluation

Mode When What
Offline (batch) Pre-deployment / CI Run test set through pipeline, compute metrics, compare to baseline
Online (production) Post-deployment Sample live traffic, score with LLM judge, track dashboards

Offline evaluation checklist: - Pin all versions (embedding model, retriever config, LLM, prompt template) - Log raw inputs/outputs alongside metrics for debugging - Set a minimum acceptable threshold per metric (e.g., Faithfulness ≥ 0.85) - Fail CI if any metric regresses beyond a tolerance (e.g., ≥ 5% relative drop)

Online sampling: - Don’t score 100% of traffic — sample 5–10% to control LLM judge cost - Use stratified sampling to cover rare query types


7. LLM-as-Judge

Using a strong LLM (e.g., GPT-4o) to judge outputs is now standard because it scales better than human annotation and correlates well with human judgment.

Prompt pattern:

You are an expert evaluator. Given the question, retrieved context, and answer below,
rate the faithfulness of the answer on a scale of 1–5.

Question: {question}
Context: {context}
Answer: {answer}

Faithfulness score (1=completely unfaithful, 5=fully supported): 

Pitfalls: - Judge LLM can be biased toward longer or more confident-sounding answers - Self-consistency: same judge returns different scores on re-runs → average over 3 runs - Use a different model as judge than as generator to avoid self-preference bias


8. Regression Tracking

Treat evaluation like a software test suite: every pipeline change should be validated against the baseline before shipping.

Workflow:

1. Commit pipeline change (new embedding model, chunk size, prompt, etc.)
2. Run eval suite on pinned test set
3. Compare metric deltas to baseline
4. If any metric drops > threshold → block merge
5. If all metrics pass → update baseline, merge

Tools: - MLflow / W&B — log eval runs, compare experiments, track metric history - RAGAS CI integration — run ragas evaluate in GitHub Actions - LangSmith / Phoenix — built-in dataset + eval runner with regression dashboards

Baseline storage:

{
  "version": "2024-11-01",
  "faithfulness": 0.91,
  "answer_relevancy": 0.87,
  "context_recall": 0.79
}

9. End-to-End vs. Component Evaluation

Level What to Measure When to Use
Component: Retriever Recall@K, Precision@K Optimising chunk size, embedding model, top-K
Component: Reranker NDCG, precision delta vs. no reranker Evaluating reranker effectiveness
Component: Generator Faithfulness, Answer Relevance Optimising prompts, models
End-to-end Answer Correctness, user satisfaction Validating overall pipeline

Tip: Start with end-to-end. If overall quality is poor, drill into components to isolate the bottleneck.


Summary

Metric Scope Requires ground truth?
Context Relevance Retriever No (LLM judge)
Faithfulness Generator No (LLM judge)
Answer Relevance Generator No (LLM judge)
Context Recall Retriever Yes
Answer Correctness End-to-end Yes

The no-reference metrics (Faithfulness, Answer Relevance, Context Relevance) are your day-to-day production monitors. The reference-based metrics are for periodic deep-dives with curated test sets.

Back to top