Caching in RAG
1. Why Cache in RAG?
RAG pipelines have three expensive steps per request: 1. Embedding — encode the query via an embedding API (~10–30 ms, billable tokens) 2. Generation — call the LLM (~500–3000 ms, most expensive) 3. Retrieval — vector search (~5–50 ms, often the cheapest step)
For many applications, a significant fraction of queries are repeated or semantically very similar (FAQ-type questions, common lookups). Caching those saves real money and latency.
Typical cache hit rates by domain: - Internal knowledge-base chatbot: 30–60% (many repeated questions) - Customer support: 20–40% - General-purpose assistant: 5–15%
2. Query-Level Caching (Answer Cache)
Cache the full answer for a query so the entire RAG pipeline is bypassed on a hit.
Exact-match cache:
import hashlib, redis
def rag_with_cache(query: str, tenant_id: str) -> str:
cache_key = hashlib.sha256(f"{tenant_id}:{query}".encode()).hexdigest()
cached = redis.get(cache_key)
if cached:
return cached.decode()
answer = full_rag_pipeline(query, tenant_id)
redis.setex(cache_key, 3600, answer) # TTL: 1 hour
return answerLimitations of exact-match: - “What is our return policy?” and “What’s the return policy?” are different keys → low hit rate - Solution: semantic cache (see section 3)
When to use exact-match: high-traffic, predictable queries (pre-generated FAQ responses, report summaries).
3. Semantic Cache
A semantic cache matches queries by meaning (embedding similarity) rather than exact text, dramatically increasing hit rate.
How it works: 1. Embed the incoming query 2. Search a “cache index” of previously answered queries 3. If cosine similarity > threshold (e.g. 0.95), return the cached answer 4. Otherwise, run the full pipeline and store the result in the cache index
from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer("all-MiniLM-L6-v2")
def semantic_cache_lookup(query: str, threshold=0.95):
query_vec = encoder.encode(query)
results = cache_index.search(query_vec, top_k=1)
if results and results[0].score >= threshold:
return results[0].metadata["cached_answer"]
return NoneThreshold tuning: - High threshold (0.97+): conservative, fewer false hits, lower hit rate - Low threshold (0.90): aggressive, more hits but risk of returning wrong answers
Tools: GPTCache, LangChain SemanticSimilarityExampleSelector, custom Faiss/Qdrant cache index.
4. Embedding Cache
Caching query embeddings avoids calling the embedding API for repeated or near-identical queries.
Simple TTL-based embedding cache:
from functools import lru_cache
import hashlib
@lru_cache(maxsize=10_000)
def embed_query_cached(query: str) -> list[float]:
return embedding_api.embed(query)For multi-process / distributed systems, use Redis:
def embed_with_redis_cache(query: str) -> list[float]:
key = f"embed:{hashlib.sha256(query.encode()).hexdigest()}"
cached = redis.get(key)
if cached:
return json.loads(cached)
vec = embedding_api.embed(query)
redis.setex(key, 86400, json.dumps(vec)) # 24h TTL
return vecCost impact: At $0.02 per million tokens, embedding queries is cheap but adds ~20 ms latency. For high-frequency queries, caching eliminates this latency entirely.
5. Document / Chunk Cache
Cache the processed representation of documents so re-indexing doesn’t re-embed unchanged content.
Ingestion-time chunk cache:
def ingest_with_chunk_cache(doc_id: str, content: str):
checksum = sha256(content)
# Check if embeddings already exist for this exact content
cached_embeddings = chunk_embed_cache.get(checksum)
if cached_embeddings:
# Re-use stored embeddings, just upsert with updated metadata
vector_index.upsert(cached_embeddings)
return
# Compute from scratch
chunks = chunk(content)
embeddings = embed_batch(chunks)
chunk_embed_cache.set(checksum, embeddings, ttl=30_days)
vector_index.upsert(embeddings)Benefit: When a document is re-ingested (e.g., only metadata changed, not content), skip re-embedding. Saves significant cost for large corpora.
6. Cache Invalidation
Caches must be invalidated when the underlying data changes.
Invalidation triggers per layer:
| Cache Layer | Invalidation Trigger |
|---|---|
| Answer cache | Source document updated → delete cached answers for affected queries |
| Semantic cache | Hard to invalidate selectively → use short TTL (minutes to hours) |
| Embedding cache | Query embeddings don’t expire (embeddings are deterministic) |
| Chunk embed cache | Content checksum changes → cache miss → recompute |
Practical TTL defaults:
| Layer | Recommended TTL |
|---|---|
| Answer (exact) | 1–24 hours (depends on doc update frequency) |
| Answer (semantic) | 15–60 minutes |
| Query embedding | 24 hours |
| Chunk embedding | 30 days |
Access control and caching: Never share cached answers across tenants. Cache keys must include tenant_id (and user role if fine-grained ACLs apply).
7. Caching and Streaming
Answer caching is incompatible with streaming responses — you can’t cache a stream mid-flight.
Solutions: - Buffer then cache: collect the full stream into a string, cache it, then return it (lose streaming UX on first call but serve cached responses instantly) - Cache at the retrieval level only: cache the retrieved chunks, still stream the generation - Async populate: stream the response to the user, write to cache in the background after the stream completes
async def rag_stream_with_cache(query):
cached = await cache.get(query)
if cached:
yield cached # Return full cached answer as one chunk
return
full_answer = []
async for token in llm_stream(query):
full_answer.append(token)
yield token
# Cache asynchronously after streaming completes
asyncio.create_task(cache.set(query, "".join(full_answer)))Summary
| Cache Layer | What Is Cached | Saves | Key Challenge |
|---|---|---|---|
| Answer (exact) | Full answer keyed by query hash | LLM call + retrieval | Low hit rate |
| Answer (semantic) | Full answer keyed by similar query | LLM call + retrieval | Threshold tuning |
| Embedding | Query vector | Embedding API call | N/A (deterministic) |
| Chunk embedding | Document chunk vectors | Re-embedding on ingest | Invalidation on content change |
Most impactful: Semantic answer cache for high-traffic FAQ-like RAG systems. Start there before optimising other layers.