Chunking for RAG
Core tradeoff
| Small chunks | Large chunks |
|---|---|
| Precise retrieval | More context per chunk |
| Less noise passed to LLM | Fewer chunks to manage |
| May lose surrounding context | May retrieve irrelevant content |
| Higher chunk count, more storage | Lower recall on specific facts |
Sweet spot in practice: 256–512 tokens with overlap.
Strategies
1. Fixed-size with overlap
Split every N tokens, slide forward by N-overlap. Simple, predictable, works well as a baseline.
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
length_function=len, # swap for tiktoken for token-accurate splits
)
chunks = splitter.split_text(document_text)Token-accurate version:
import tiktoken
from langchain_text_splitters import RecursiveCharacterTextSplitter
enc = tiktoken.get_encoding("cl100k_base")
def token_len(text):
return len(enc.encode(text))
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
length_function=token_len,
)2. Recursive character splitting
Tries to split on \n\n, then \n, then , then characters — respects natural text boundaries. This is the recommended default.
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
separators=["\n\n", "\n", ". ", " ", ""],
chunk_size=512,
chunk_overlap=50,
)
chunks = splitter.create_documents(
texts=[document_text],
metadatas=[{"source": "my_doc.pdf", "page": 1}]
)3. Semantic chunking
Embed every sentence, split where embedding similarity drops — i.e. where the topic changes. Higher quality, slower to build index.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
splitter = SemanticChunker(
embeddings=OpenAIEmbeddings(),
breakpoint_threshold_type="percentile", # or "standard_deviation", "interquartile"
breakpoint_threshold_amount=95,
)
chunks = splitter.split_text(document_text)4. Document-structure-aware splitting
Respect headings, sections, tables. Best for structured docs (Markdown, HTML, PDFs with structure).
from langchain_text_splitters import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "h1"),
("##", "h2"),
("###", "h3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_text(markdown_text)
# Each chunk carries {"h1": "...", "h2": "..."} metadata automaticallyFor HTML:
from langchain_text_splitters import HTMLHeaderTextSplitter
splitter = HTMLHeaderTextSplitter(headers_to_split_on=[
("h1", "header_1"),
("h2", "header_2"),
])
chunks = splitter.split_text(html_content)5. Hierarchical / parent-child chunking
Index small child chunks for precise retrieval, but pass the larger parent chunk to the LLM for context. Best of both worlds.
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
# Small chunks go into the vector index
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)
# Large chunks are what gets returned to the LLM
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
vectorstore = Chroma(embedding_function=OpenAIEmbeddings())
store = InMemoryStore() # swap for Redis/DB in prod
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
retriever.add_documents(documents)
# Retrieves small chunks but returns parent context
results = retriever.invoke("your query")6. Agentic / late chunking
Embed the full document with a long-context model, then chunk the embeddings rather than the text. Preserves full document context in every chunk’s vector. Requires models that support it (e.g. jina-embeddings-v3).
Not widely available yet — worth watching but not standard prod practice.
Overlap — why it matters
Without overlap, a sentence split across a boundary loses context in both halves. Overlap of 10–15% of chunk size covers this.
# 512 token chunks, ~50 token overlap = ~10% overlap
# Good default. Don't go above 20% — wasteful and degrades precision.
chunk_size=512
chunk_overlap=50Chunk size guidelines by use case
| Use case | Recommended size |
|---|---|
| Q&A over dense docs | 256–512 tokens |
| Summarisation | 512–1024 tokens |
| Code | Full function / class (structure-aware) |
| Conversational / FAQ | 128–256 tokens |
| Long-form reports | Hierarchical (200 child / 1000 parent) |
What to store per chunk
{
"id": "uuid",
"document_id": "parent_doc_uuid",
"chunk_index": 3,
"content": "...",
"embedding": [...],
"content_hash": "sha256_of_content",
"chunk_strategy": "recursive_512_50",
"embedding_model": "text-embedding-3-small",
"source_path": "uploads/report.pdf",
"page_number": 2,
"section_heading": "Results",
"token_count": 487,
"created_at": "...",
}Production recommendations
Use recursive character splitting as your default. It handles most text types well without the overhead of semantic splitting.
Add semantic chunking for high-value, unstructured long documents where topic shifts matter — clinical notes, research papers, legal docs.
Use hierarchical chunking when retrieval precision is high but LLM answers feel incomplete or out of context.
Always store chunk_strategy and embedding_model — when you change either, you need to re-index everything and knowing which chunks are stale is essential.
Re-index on document update, not on a schedule. Use content_hash to detect actual changes and skip unchanged chunks.
import hashlib
def should_reindex(existing_hash: str, new_content: str) -> bool:
new_hash = hashlib.sha256(new_content.encode()).hexdigest()
return existing_hash != new_hashValidate chunk quality before writing to the index — filter empty chunks, chunks under ~20 tokens, and chunks that are pure whitespace/boilerplate.
def is_valid_chunk(text: str, min_tokens: int = 20) -> bool:
tokens = enc.encode(text)
return len(tokens) >= min_tokens and text.strip()
chunks = [c for c in raw_chunks if is_valid_chunk(c)]