Observability in RAG
1. What to Log
Every RAG request is a chain of operations. Observability requires logging the full chain together so you can replay and debug any failure.
Minimum log record per request:
{
"request_id": "uuid",
"timestamp": "2024-11-01T12:00:00Z",
"query": "What are the refund policies?",
"query_embedding_ms": 18,
"retrieved_chunks": [
{"id": "doc42_chunk3", "score": 0.91, "text": "..."},
{"id": "doc17_chunk1", "score": 0.88, "text": "..."}
],
"retrieval_ms": 42,
"reranker_ms": 120,
"prompt_tokens": 1240,
"completion_tokens": 180,
"generation_ms": 890,
"answer": "Our refund policy allows returns within 30 days...",
"total_latency_ms": 1070
}Never log PII without masking. Strip or hash user identifiers before writing to observability stores.
2. Tracing the RAG Chain
A single user request spans multiple services (embedding API, vector DB, LLM). Distributed tracing ties them together with a shared trace_id.
OpenTelemetry integration pattern:
from opentelemetry import trace
tracer = trace.get_tracer("rag-pipeline")
with tracer.start_as_current_span("rag_request") as span:
span.set_attribute("query", user_query)
with tracer.start_as_current_span("embed_query"):
query_vector = embed(user_query)
with tracer.start_as_current_span("vector_search"):
chunks = index.search(query_vector, top_k=5)
with tracer.start_as_current_span("generate"):
answer = llm.generate(build_prompt(query, chunks))Tools that provide RAG-aware tracing out of the box: - LangSmith (LangChain) — full chain visualization - Arize Phoenix — open-source, LLM-first observability - Weave (Weights & Biases) — trace + eval combined - Helicone / Braintrust — LLM call logging with replay
3. Latency Breakdown
For production SLAs you need to know where time is spent, not just total latency.
| Step | Typical Latency | Notes |
|---|---|---|
| Query embedding | 10–30 ms | Cached after first call |
| Vector search | 5–50 ms | Depends on index size and ANN params |
| Re-ranking | 50–300 ms | Cross-encoder is expensive |
| LLM generation | 500–3000 ms | Largest contributor; streaming hides it |
| Total | ~800–3500 ms | Target < 2 s for interactive UX |
p50 / p95 / p99 breakdown is more informative than mean — a slow tail at p99 often signals retrieval timeouts or cold LLM instances.
4. Quality Monitoring in Production
Beyond latency, you need signals that answer quality is not degrading.
Metric streams to track: - Inline LLM judge score — sample 5–10% of requests; score Faithfulness + Answer Relevance; alert on rolling average drop - User feedback signals — thumbs up/down, copy-paste rate, follow-up question rate (indirect quality signal) - Retrieval zero-hit rate — fraction of queries where top-1 chunk score < threshold (signals corpus gaps) - Token usage — track input/output tokens per request to catch prompt bloat regressions
Dashboard example (Grafana / Datadog):
Panels:
- p50/p95 total latency (time series)
- Faithfulness score rolling 24h average (gauge)
- Zero-hit rate (% time series)
- Token usage per request (histogram)
- Error rate by step (retrieval, rerank, generation)
5. Debugging with Trace Replay
When a user reports a bad answer, you need to replay the exact request.
Requirements for replay: 1. Stored full trace (query, retrieved chunks at time of request, answer) 2. Pinned versions (embedding model, index snapshot, prompt version, LLM version) 3. A replay tool that re-runs generation given the stored chunks (not re-retrieval)
Important: Re-running retrieval against an updated index won’t reproduce the failure if the index has changed. Always log the chunk content not just the chunk IDs.
Root-cause checklist:
Bad answer received
├── Was the relevant chunk retrieved? → NO → Retrieval problem (embedding, top-K, filtering)
├── Was the relevant chunk in top-K but not used? → Context window / ordering problem
├── Chunk was used but answer is wrong → Generation / prompt problem
└── Chunk was wrong → Corpus quality / chunking problem
6. Alerting
Set alerts on the metrics most likely to surface real problems early.
| Alert | Condition | Severity |
|---|---|---|
| High error rate | Error rate > 1% over 5 min | Critical |
| Latency spike | p95 > 5 s for 10 min | Warning |
| Faithfulness drop | Rolling 1h average < 0.75 | Warning |
| Zero-hit surge | Zero-hit rate > 15% over 30 min | Warning |
| Token budget overrun | Mean prompt tokens > 90% of context window | Info |
Avoid alert fatigue: Only alert on actionable signals. Log everything else for post-hoc analysis.
Summary
Every RAG request → structured log record (query + chunks + answer + latencies)
→ trace spans (embed, search, rerank, generate)
→ sampled quality score (LLM judge)
→ aggregated dashboards + alerts
The goal is that any production failure is diagnosable within minutes using logs alone, without needing to reproduce it from scratch.