Query Transformation for RAG
Info
"what's the recovery thing after knee op?"
→ transform
→ ["post-operative knee rehabilitation protocol",
"ACL surgery recovery timeline",
"knee replacement physiotherapy exercises"]
Strategies
| Strategy | What it does | Best for |
|---|---|---|
| Query expansion | Generate multiple phrasings | Ambiguous / underspecified queries |
| HyDE | Generate hypothetical answer, embed that | Short queries, question→answer domain gap |
| Step-back | Rephrase to more general question | Narrow queries missing broader context |
| Query decomposition | Split into sub-questions | Complex multi-hop queries |
| Query rewriting | Clean and clarify the raw query | Noisy / conversational input |
| Routing | Classify query to correct index/source | Multi-index or multi-tenant setups |
Local LLM setup (Ollama)
All examples below use Ollama for OSS. Install once, use across all strategies.
# install + pull a model
curl -fsSL https://ollama.com/install.sh | sh
ollama pull mistral # good default
ollama pull llama3.2 # stronger, larger
ollama pull qwen2.5:3b # fast + lightweight# llm.py — shared client
import httpx
import json
OLLAMA_BASE = "http://localhost:11434"
def generate(prompt: str, model: str = "mistral") -> str:
resp = httpx.post(
f"{OLLAMA_BASE}/api/generate",
json={"model": model, "prompt": prompt, "stream": False},
timeout=60.0,
)
resp.raise_for_status()
return resp.json()["response"].strip()
def generate_json(prompt: str, model: str = "mistral") -> dict | list:
"""Force JSON output — works reliably with mistral/llama3."""
resp = httpx.post(
f"{OLLAMA_BASE}/api/generate",
json={
"model": model,
"prompt": prompt,
"stream": False,
"format": "json", # Ollama native JSON mode
},
timeout=60.0,
)
resp.raise_for_status()
return json.loads(resp.json()["response"])1. Query expansion
Generate N alternative phrasings. Retrieve for all, fuse with RRF. Best general-purpose transformation.
def expand_query(query: str, n: int = 3) -> list[str]:
result = generate_json(f"""Generate {n} alternative search queries for the question below.
Each should approach the question from a different angle or use different terminology.
Return a JSON object with key "queries" containing a list of strings.
Do not include the original question.
Question: {query}""")
variants = result.get("queries", [])
return [query] + [v for v in variants if v.strip()]
# Example output:
# expand_query("recovery time after ACL surgery")
# → [
# "recovery time after ACL surgery",
# "ACL reconstruction rehabilitation duration",
# "how long does ACL surgery recovery take",
# "post-operative ACL physiotherapy timeline",
# ]Usage in retrieval:
from .retrieval import dense_retrieve
from .rrf import reciprocal_rank_fusion
def multi_query_retrieve(
query: str,
tenant_id: str,
top_k: int = 10,
) -> list[dict]:
queries = expand_query(query, n=3)
all_lists = []
for q in queries:
results = dense_retrieve(q, tenant_id, top_k=top_k)
all_lists.append([
{"id": str(c.id), "content": c.content}
for c in results
])
fused = reciprocal_rank_fusion(all_lists)
return fused[:top_k]2. HyDE (Hypothetical Document Embeddings)
Generate a fake answer, embed it, retrieve on the answer vector. Works because real documents live in “answer space” — closer to a fake answer than a raw question.
"what is ACL recovery time?"
→ generate fake passage about ACL recovery
→ embed fake passage
→ retrieve real documents near that embedding
from .embeddings import embed
from .models import DocumentChunk
from pgvector.django import CosineDistance
def generate_hypothetical_doc(query: str) -> str:
return generate(f"""Write a short factual passage (3-5 sentences) that directly
answers the following question. Write only the passage itself — no preamble,
no "here is" or "this passage" — just the content.
Question: {query}""")
def hyde_retrieve(
query: str,
tenant_id: str,
top_k: int = 10,
) -> list[dict]:
hypo_doc = generate_hypothetical_doc(query)
# Embed as a document — no instruction prefix
hypo_vec = embed([hypo_doc])[0]
results = list(
DocumentChunk.objects
.filter(tenant_id=tenant_id, index_status="indexed")
.annotate(distance=CosineDistance("embedding", hypo_vec))
.order_by("distance")
[:top_k]
)
return [{"id": str(c.id), "content": c.content} for c in results]When HyDE helps most: - Short, vague queries where the question form is very different from document form - FAQ or knowledge-base style docs where answers are declarative passages
When HyDE hurts: - Factual queries where the hypothetical doc hallucinates wrong details - The fake doc confidently embeds in the wrong direction
Hedge by combining with standard dense retrieval:
def hyde_plus_dense_retrieve(
query: str,
tenant_id: str,
top_k: int = 10,
) -> list[dict]:
hyde_results = hyde_retrieve(query, tenant_id, top_k=top_k)
dense_results = dense_retrieve(query, tenant_id, top_k=top_k)
dense_dicts = [{"id": str(c.id), "content": c.content} for c in dense_results]
fused = reciprocal_rank_fusion([hyde_results, dense_dicts])
return fused[:top_k]3. Step-back prompting
Rephrase a narrow or specific query to a broader question that captures the context needed to answer it. Then retrieve on both the original and step-back query.
"what exercises should I do 3 weeks post ACL surgery?"
→ step back →
"what is the standard ACL surgery rehabilitation protocol?"
def step_back(query: str) -> str:
return generate(f"""Rewrite the following question as a broader, more general question
that would help find background knowledge needed to answer it.
Return only the rewritten question, nothing else.
Original question: {query}""")
def step_back_retrieve(
query: str,
tenant_id: str,
top_k: int = 10,
) -> list[dict]:
broader_query = step_back(query)
original_results = dense_retrieve(query, tenant_id, top_k=top_k)
broader_results = dense_retrieve(broader_query, tenant_id, top_k=top_k)
original_dicts = [{"id": str(c.id), "content": c.content} for c in original_results]
broader_dicts = [{"id": str(c.id), "content": c.content} for c in broader_results]
fused = reciprocal_rank_fusion([original_dicts, broader_dicts])
return fused[:top_k]4. Query decomposition
Break a complex multi-part query into atomic sub-questions. Retrieve for each independently, merge results. Essential for multi-hop reasoning.
"what are the differences between ACL and MCL recovery
and which has better outcomes with physio?"
→ decompose →
[
"ACL surgery recovery timeline and protocol",
"MCL injury recovery timeline and protocol",
"physiotherapy outcomes for ACL vs MCL injuries",
]
def decompose_query(query: str) -> list[str]:
result = generate_json(f"""Break the following question into simpler, independent
sub-questions that can each be answered separately.
Return a JSON object with key "sub_questions" containing a list of strings.
If the question is already simple, return it as a single item list.
Question: {query}""")
return result.get("sub_questions", [query])
def decomposed_retrieve(
query: str,
tenant_id: str,
top_k: int = 10,
) -> list[dict]:
sub_questions = decompose_query(query)
all_lists = []
for sub_q in sub_questions:
results = dense_retrieve(sub_q, tenant_id, top_k=top_k)
all_lists.append([
{"id": str(c.id), "content": c.content}
for c in results
])
fused = reciprocal_rank_fusion(all_lists)
return fused[:top_k]5. Query rewriting
Clean conversational or noisy input into a crisp retrieval query. Essential when input comes from a chat UI where users type casually.
"uhh yeah like i was asking about the knee thing,
what did you say about how long it takes again?"
→ rewrite →
"knee surgery recovery duration"
def rewrite_query(
query: str,
history: list[dict] | None = None, # [{"role": "user"|"assistant", "content": ...}]
) -> str:
history_str = ""
if history:
history_str = "\n".join(
f"{m['role'].upper()}: {m['content']}"
for m in history[-4:] # last 2 turns
)
history_str = f"Conversation history:\n{history_str}\n\n"
return generate(f"""{history_str}Rewrite the following user query into a clear,
concise search query suitable for a document retrieval system.
Remove filler words, resolve pronouns using conversation history if provided,
and use precise terminology.
Return only the rewritten query, nothing else.
User query: {query}""")6. Query routing
Classify the query and route to the appropriate index, collection, or retrieval strategy. Critical for multi-tenant or multi-source setups.
from dataclasses import dataclass
from typing import Literal
@dataclass
class RouteDecision:
index: str # which index/collection to search
strategy: Literal["dense", "hybrid", "bm25"]
reason: str
AVAILABLE_INDEXES = {
"clinical_notes": "Patient clinical notes and assessments",
"questionnaires": "Patient questionnaire responses and scores",
"knowledge_base": "General healthcare and rehabilitation knowledge",
"appointment_notes": "Appointment summaries and practitioner notes",
}
def route_query(query: str) -> RouteDecision:
index_descriptions = "\n".join(
f"- {k}: {v}" for k, v in AVAILABLE_INDEXES.items()
)
result = generate_json(f"""Given the following query and available indexes,
decide which index to search and which retrieval strategy to use.
Available indexes:
{index_descriptions}
Retrieval strategies:
- dense: semantic similarity search, best for conceptual questions
- hybrid: dense + keyword search, best for most queries
- bm25: keyword only, best for exact terms, IDs, codes
Return a JSON object with keys:
- "index": one of the index names above
- "strategy": one of dense, hybrid, bm25
- "reason": brief explanation
Query: {query}""")
return RouteDecision(
index = result.get("index", "knowledge_base"),
strategy = result.get("strategy", "hybrid"),
reason = result.get("reason", ""),
)
def routed_retrieve(
query: str,
tenant_id: str,
top_k: int = 10,
) -> list[dict]:
decision = route_query(query)
# Use decision.index to filter, decision.strategy to pick retrieval method
filters = {"source_type": decision.index}
config = RetrievalConfig(
strategy=decision.strategy,
top_k=top_k,
filters=filters,
)
return retrieve(query, tenant_id, config)Full transformation pipeline
Compose strategies based on query characteristics:
from dataclasses import dataclass, field
from typing import Literal
@dataclass
class TransformConfig:
rewrite: bool = True # always clean input first
strategy: Literal[
"expand", "hyde", "step_back",
"decompose", "direct"
] = "expand" # default expansion
use_routing: bool = False
def transform_and_retrieve(
query: str,
tenant_id: str,
history: list[dict] | None = None,
top_k: int = 10,
config: TransformConfig = TransformConfig(),
) -> list[dict]:
# Step 1 — always rewrite noisy input first
clean_query = rewrite_query(query, history) if config.rewrite else query
# Step 2 — optionally route to correct index
if config.use_routing:
return routed_retrieve(clean_query, tenant_id, top_k)
# Step 3 — apply transformation strategy
match config.strategy:
case "expand":
return multi_query_retrieve(clean_query, tenant_id, top_k)
case "hyde":
return hyde_plus_dense_retrieve(clean_query, tenant_id, top_k)
case "step_back":
return step_back_retrieve(clean_query, tenant_id, top_k)
case "decompose":
return decomposed_retrieve(clean_query, tenant_id, top_k)
case "direct":
results = dense_retrieve(clean_query, tenant_id, top_k)
return [{"id": str(c.id), "content": c.content} for c in results]Strategy selection guide
| Query type | Recommended strategy |
|---|---|
| Short, clear factual question | HyDE or direct |
| Vague / conversational | Rewrite → expand |
| Multi-part / complex | Decompose |
| Narrow / overly specific | Step-back |
| Multi-index setup | Route first |
| General default | Rewrite → expand → hybrid retrieval |
Latency cost per strategy
| Strategy | Extra LLM calls | Typical added latency |
|---|---|---|
| Rewrite | 1 | ~300ms |
| Expand (3 variants) | 1 | ~500ms + 3× retrieval |
| HyDE | 1 | ~500ms |
| Step-back | 1 | ~300ms + 2× retrieval |
| Decompose | 1 | ~500ms + N× retrieval |
| Route | 1 | ~300ms |
All use local Ollama — latency depends on your hardware. On GPU, add ~50–100ms per call. On CPU, multiply by 5–10×.
Common failure modes
| Problem | Fix |
|---|---|
| Expansion generates near-identical variants | Use higher temperature, stronger prompt instruction |
| HyDE hallucinates wrong facts into embedding | Combine with standard dense, don’t use HyDE alone |
| Decomposition produces too many sub-questions | Cap at 4–5, limit top_k per sub-question |
| Step-back too broad, retrieves irrelevant context | Fuse with original query results via RRF |
| Routing misclassifies query | Add few-shot examples to routing prompt |
| Rewrite loses important specifics | Include conversation history, prompt to preserve key terms |
| LLM call adds too much latency | Cache transformed queries by hash, use smaller/faster model |