Document Pre-processing for RAG
1. Why Pre-processing Matters
Garbage in, garbage out. The most sophisticated retrieval pipeline cannot recover from poorly extracted source text. Common failures:
| Root cause | Symptom at query time |
|---|---|
| Table rows merged into a single line | Numerical answers are garbled |
| PDF columns not separated | Sentences from two columns are fused |
| Scanned image with no OCR | Entire page is invisible to the retriever |
| Boilerplate headers/footers | Retrieved chunks polluted with irrelevant text |
| Missing section titles | Re-ranking has no structural signal |
Pre-processing is highly domain-specific — a pipeline that works for web pages may fail completely on engineering PDFs.
2. Document Formats and Parsers
Each format has different parsing challenges:
The hardest format. PDFs are a layout specification, not a text format. Text extraction order can be wrong, columns may merge, and embedded images are invisible.
Parser options:
# pdfminer — low-level, good for plain text PDFs
from pdfminer.high_level import extract_text
text = extract_text("doc.pdf")
# pymupdf (fitz) — fast, handles layout better
import fitz
doc = fitz.open("doc.pdf")
pages = [page.get_text("text") for page in doc]
# unstructured — highest quality, handles mixed content
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf("doc.pdf", strategy="hi_res")Word / DOCX
from docx import Document
doc = Document("report.docx")
paragraphs = [p.text for p in doc.paragraphs if p.text.strip()]
tables = [[cell.text for cell in row.cells] for table in doc.tables for row in table.rows]HTML / Web pages
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
# Remove boilerplate
for tag in soup(["nav", "footer", "header", "script", "style"]):
tag.decompose()
text = soup.get_text(separator="\n", strip=True)Markdown / plain text
Easiest — structure is already explicit. Use heading markers (#, ##) as natural chunk boundaries.
3. Table Extraction
Tables are especially fragile — naive text extraction destroys their structure.
Strategy options:
- Convert to Markdown table — preserves row/column relationships; works well in context
- Convert to natural language — use an LLM to summarise: “Table showing Q1-Q4 revenue by region…”
- Store as structured data — query with Text-to-SQL at retrieval time (TableGPT approach)
- Hybrid — store both the Markdown and a text summary; embed the summary, retrieve the Markdown
# Table Transformer (Microsoft) — extracts tables from PDFs as images
from transformers import pipeline
table_extractor = pipeline("object-detection", model="microsoft/table-transformer-detection")
# camelot — extracts tables from PDFs with cell coordinates
import camelot
tables = camelot.read_pdf("report.pdf", pages="all", flavor="lattice")
df = tables[0].df # pandas DataFrame
markdown = df.to_markdown(index=False)Rule of thumb: if a table is the primary source of the answer, convert to Markdown and include it as its own chunk with a descriptive header.
4. OCR for Scanned Documents
Scanned PDFs and image files contain no machine-readable text — they must go through OCR before chunking.
# Tesseract via pytesseract
from PIL import Image
import pytesseract
image = Image.open("scanned_page.png")
text = pytesseract.image_to_string(image, lang="eng")
# For PDFs: convert pages to images first
from pdf2image import convert_from_path
images = convert_from_path("scanned.pdf", dpi=300)
pages = [pytesseract.image_to_string(img) for img in images]
# unstructured handles this automatically with strategy="hi_res"
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf("scanned.pdf", strategy="hi_res", languages=["eng"])OCR quality tips: - Use 300 DPI minimum for reliable character recognition - Pre-process images: deskew, remove noise, increase contrast - Validate output: check character-to-space ratios for garbage detection - For critical documents, consider cloud OCR (AWS Textract, Google Document AI) — significantly more accurate than Tesseract
5. Metadata Extraction
Metadata attached to chunks at index time enables precise filtering at query time (narrowing scope before vector search).
Standard metadata fields:
metadata = {
"source": "annual_report_2024.pdf",
"page": 12,
"section": "Financial Results",
"author": "Finance Team",
"created_at": "2024-03-15",
"doc_type": "report", # report | policy | faq | email
"department": "finance",
"language": "en",
"classification": "internal", # public | internal | confidential
}LLM-generated metadata: For unstructured documents, use an LLM to extract structured metadata:
prompt = """
Extract the following from this document excerpt:
- topic (one of: policy, procedure, faq, technical, financial, legal)
- key_entities: list of company/product/person names mentioned
- date_references: any dates or time periods mentioned
Respond as JSON.
Document: {text}
"""Hypothetical question metadata (Reverse HyDE): Generate questions this chunk can answer. At query time, match query → hypothetical question for better semantic alignment.
6. Cleaning and Normalisation
Raw extracted text typically contains noise that degrades embedding quality.
import re
def clean_text(text: str) -> str:
# Remove excessive whitespace
text = re.sub(r"\s+", " ", text).strip()
# Remove repeated punctuation (OCR artefacts)
text = re.sub(r"[.]{3,}", "...", text)
text = re.sub(r"[-]{3,}", "---", text)
# Remove page numbers and headers/footers (regex-based heuristics)
text = re.sub(r"^Page \d+ of \d+", "", text, flags=re.MULTILINE)
text = re.sub(r"^CONFIDENTIAL\s*$", "", text, flags=re.MULTILINE)
# Normalise Unicode (important for multilingual docs)
import unicodedata
text = unicodedata.normalize("NFKC", text)
return textBoilerplate removal strategies: - Heuristic: lines that appear in >80% of documents are likely boilerplate - LLM-based: prompt to identify and strip headers, footers, legal disclaimers - Structural: for PDFs, detect repeated text at fixed page positions
7. Pre-processing Pipeline Design
A production ingestion pipeline chains these stages:
Raw file
└─► Format detection
└─► Parser (pdfminer / docx / html / ...)
├─► OCR (if scanned)
├─► Table extractor
└─► Text extractor
└─► Cleaner / normaliser
└─► Metadata extractor
└─► Chunker → Embedder → Index
Key design decisions:
| Decision | Options |
|---|---|
| Parser choice | Speed vs quality — pymupdf (fast) vs unstructured (best quality) |
| Image handling | Skip images / OCR / multimodal embedding |
| Table handling | Markdown / natural language / structured store |
| Metadata source | File system attributes / LLM extraction / manual |
| Quality gate | Minimum chunk length, language detection, garbage detection |
Quality gate example:
def is_valid_chunk(text: str, min_length: int = 50) -> bool:
if len(text.strip()) < min_length:
return False
# Detect garbage OCR output (high proportion of non-alpha chars)
alpha_ratio = sum(c.isalpha() for c in text) / max(len(text), 1)
if alpha_ratio < 0.4:
return False
return TrueBottom line: Invest in pre-processing early. A small improvement here has multiplicative effects on retrieval quality across all downstream stages.