Observability in RAG

Observability means capturing every step of a RAG request — query, retrieved chunks, generation — in a way that lets you debug failures, track latency, and monitor quality in production.

Author

Benedict Thekkel

1. What to Log

Every RAG request is a chain of operations. Observability requires logging the full chain together so you can replay and debug any failure.

Minimum log record per request:

{
  "request_id": "uuid",
  "timestamp": "2024-11-01T12:00:00Z",
  "query": "What are the refund policies?",
  "query_embedding_ms": 18,
  "retrieved_chunks": [
    {"id": "doc42_chunk3", "score": 0.91, "text": "..."},
    {"id": "doc17_chunk1", "score": 0.88, "text": "..."}
  ],
  "retrieval_ms": 42,
  "reranker_ms": 120,
  "prompt_tokens": 1240,
  "completion_tokens": 180,
  "generation_ms": 890,
  "answer": "Our refund policy allows returns within 30 days...",
  "total_latency_ms": 1070
}

Never log PII without masking. Strip or hash user identifiers before writing to observability stores.

2. Tracing the RAG Chain

A single user request spans multiple services (embedding API, vector DB, LLM). Distributed tracing ties them together with a shared trace_id.

OpenTelemetry integration pattern:

from opentelemetry import trace

tracer = trace.get_tracer("rag-pipeline")

with tracer.start_as_current_span("rag_request") as span:
    span.set_attribute("query", user_query)

    with tracer.start_as_current_span("embed_query"):
        query_vector = embed(user_query)

    with tracer.start_as_current_span("vector_search"):
        chunks = index.search(query_vector, top_k=5)

    with tracer.start_as_current_span("generate"):
        answer = llm.generate(build_prompt(query, chunks))

Tools that provide RAG-aware tracing out of the box: - LangSmith (LangChain) — full chain visualization - Arize Phoenix — open-source, LLM-first observability - Weave (Weights & Biases) — trace + eval combined - Helicone / Braintrust — LLM call logging with replay

3. Latency Breakdown

For production SLAs you need to know where time is spent, not just total latency.

Step	Typical Latency	Notes
Query embedding	10–30 ms	Cached after first call
Vector search	5–50 ms	Depends on index size and ANN params
Re-ranking	50–300 ms	Cross-encoder is expensive
LLM generation	500–3000 ms	Largest contributor; streaming hides it
Total	~800–3500 ms	Target < 2 s for interactive UX

p50 / p95 / p99 breakdown is more informative than mean — a slow tail at p99 often signals retrieval timeouts or cold LLM instances.

4. Quality Monitoring in Production

Beyond latency, you need signals that answer quality is not degrading.

Metric streams to track: - Inline LLM judge score — sample 5–10% of requests; score Faithfulness + Answer Relevance; alert on rolling average drop - User feedback signals — thumbs up/down, copy-paste rate, follow-up question rate (indirect quality signal) - Retrieval zero-hit rate — fraction of queries where top-1 chunk score < threshold (signals corpus gaps) - Token usage — track input/output tokens per request to catch prompt bloat regressions

Dashboard example (Grafana / Datadog):

Panels:
  - p50/p95 total latency (time series)
  - Faithfulness score rolling 24h average (gauge)
  - Zero-hit rate (% time series)
  - Token usage per request (histogram)
  - Error rate by step (retrieval, rerank, generation)

5. Debugging with Trace Replay

When a user reports a bad answer, you need to replay the exact request.

Requirements for replay: 1. Stored full trace (query, retrieved chunks at time of request, answer) 2. Pinned versions (embedding model, index snapshot, prompt version, LLM version) 3. A replay tool that re-runs generation given the stored chunks (not re-retrieval)

Important: Re-running retrieval against an updated index won’t reproduce the failure if the index has changed. Always log the chunk content not just the chunk IDs.

Root-cause checklist:

Bad answer received
├── Was the relevant chunk retrieved?  → NO → Retrieval problem (embedding, top-K, filtering)
├── Was the relevant chunk in top-K but not used? → Context window / ordering problem
├── Chunk was used but answer is wrong → Generation / prompt problem
└── Chunk was wrong → Corpus quality / chunking problem

6. Alerting

Set alerts on the metrics most likely to surface real problems early.

Alert	Condition	Severity
High error rate	Error rate > 1% over 5 min	Critical
Latency spike	p95 > 5 s for 10 min	Warning
Faithfulness drop	Rolling 1h average < 0.75	Warning
Zero-hit surge	Zero-hit rate > 15% over 30 min	Warning
Token budget overrun	Mean prompt tokens > 90% of context window	Info

Avoid alert fatigue: Only alert on actionable signals. Log everything else for post-hoc analysis.

Summary

Every RAG request → structured log record (query + chunks + answer + latencies)
                 → trace spans (embed, search, rerank, generate)
                 → sampled quality score (LLM judge)
                 → aggregated dashboards + alerts

The goal is that any production failure is diagnosable within minutes using logs alone, without needing to reproduce it from scratch.