Generation in RAG
1. Prompt Construction
The core task is assembling a prompt that includes the retrieved documents alongside the user’s question.
Basic structure:
System: You are a helpful assistant. Answer using only the provided context.
Context:
[Chunk 1]
[Chunk 2]
[Chunk 3]
Question: {user_query}
Answer:
Key design decisions: - Where to place context (before vs. after the question) - How to delimit chunks (XML tags, numbered sections, separators) - Whether to include chunk metadata (source, date, score) - Whether to instruct the model to say “I don’t know” if context is insufficient
2. Context Window Management
LLMs have a fixed context window. With many retrieved chunks, you can exceed it.
Strategies: - Truncation — cut off chunks that exceed the limit (lossy) - Contextual compression — use an LLM or extractor to shorten each chunk to only the relevant sentences before injecting - Map-reduce — answer each chunk independently, then synthesize the answers - Refine — iteratively refine an answer by feeding one chunk at a time
3. The Lost-in-the-Middle Problem
Research shows LLMs perform worse when relevant information is in the middle of a long context — they attend better to the beginning and end.
Mitigations: - Put the most relevant chunk first or last - Use re-ranking to surface the best chunk to a prominent position - Reduce the number of chunks passed to the prompt
4. Answer Grounding
Ensuring the generated answer actually comes from the retrieved context, not the model’s parametric memory.
Techniques: - Explicit instruction: “Only use the provided documents. Do not use prior knowledge.” - Ask the model to quote or cite the source passage - Post-generation verification (check if claims exist in context) - Faithfulness scoring with RAGAS or TruLens
5. Citation and Attribution
Telling the user where the answer came from builds trust and allows verification.
Approaches: - Inline citations: The policy was updated in 2024 [Source: doc3.pdf, p.12] - Footnotes: numbered references appended at the end - Source cards: return source metadata alongside the answer - Ask the LLM to output structured JSON with answer + sources fields
# Prompt snippet for citations
"After your answer, list the document IDs you used as: Sources: [id1, id2]"6. Handling Insufficient Context
When retrieved chunks don’t contain the answer:
- Abstention: instruct the model to say “I don’t have enough information”
- Fallback to parametric knowledge: allow the model to use its own knowledge but flag it
- Confidence signaling: ask the model to rate its own confidence
- Re-retrieval: trigger another retrieval pass with a rewritten query
7. Structured vs. Free-form Output
RAG generation doesn’t always produce prose — sometimes you need structured output.
| Output Type | Use Case |
|---|---|
| Free-form text | Q&A, summarization |
| JSON | API responses, data extraction |
| Table | Comparison queries |
| Step-by-step list | How-to / procedural queries |
Modern LLMs support function calling / structured output (e.g., OpenAI’s response_format=json_schema) to enforce output shape.
8. Streaming
For UX responsiveness, the generated answer can be streamed token-by-token rather than waiting for the full response.
Most frameworks (LangChain, LlamaIndex) support .stream() or async generators:
for chunk in chain.stream({"question": query}):
print(chunk, end="", flush=True)9. Multi-turn / Conversational RAG
In chat applications, generation must account for conversation history.
Challenges: - Prior turns contain context that affects what to retrieve - The full history can overflow the context window
Solutions: - Condense question: use the LLM to rewrite the latest question as a standalone query before retrieval - Memory summarization: compress old turns into a summary - Message windowing: only keep the last N turns
Standalone question rewrite prompt:
"Given the chat history and latest question, rewrite the question
as a standalone question that can be understood without the history."
10. Prompt Injection Risks
Retrieved documents may contain adversarial text that attempts to hijack the LLM’s behavior (e.g., a document containing “Ignore all previous instructions…”).
Mitigations: - Sanitize or escape retrieved content before injecting - Use XML/delimiters to clearly separate context from instructions - Apply input/output guardrails (e.g., LlamaGuard, Nemo Guardrails)
Summary Flow
User Query
│
▼
[Retriever] → Top-K Chunks
│
▼
[Context Window Manager] → Compressed / Ordered Chunks
│
▼
[Prompt Builder] → System + Context + Question
│
▼
[LLM] → Raw Answer
│
▼
[Post-processor] → Citations, Structured Output, Faithfulness Check
│
▼
Final Answer → User