Routing
1. What is Routing?
Routing classifies the input first, then sends it to the best-suited sub-pipeline for that category. Without routing, a single catch-all prompt is forced to handle every input type, which degrades performance for each.
┌──────────────────────────┐
┌───▶│ Handler A (e.g. refunds) │
│ └──────────────────────────┘
Input ──▶ Classifier─┤
│ ┌──────────────────────────┐
├───▶│ Handler B (tech support) │
│ └──────────────────────────┘
│
│ ┌──────────────────────────┐
└───▶│ Handler C (general FAQ) │
└──────────────────────────┘
The classifier can be: - An LLM call (flexible, handles nuance) - A traditional classifier (faster, cheaper for well-defined categories) - A keyword/rule system (deterministic, ultra-fast for simple cases)
2. When to Use Routing
Good fit when: - There are distinct input categories that are genuinely better handled differently - Optimising one category’s prompt hurts another category’s performance - You want to route easy queries to cheaper/faster models and hard ones to more capable models - Input classification can be done accurately (LLM or traditional classifier)
Examples: - Customer service: General questions → FAQ bot; Refunds → billing pipeline; Technical issues → escalation pipeline - Model cost optimisation: Simple/common questions → Claude Haiku; Complex/rare questions → Claude Sonnet - Document processing: Invoices → invoice extractor; Contracts → contract analyser; Emails → email responder - Code assistance: Bug reports → debugging prompt; Feature requests → design prompt; Questions → explanation prompt
3. Implementation Pattern
from openai import OpenAI
from enum import Enum
client = OpenAI()
class QueryType(str, Enum):
REFUND = "refund"
TECHNICAL = "technical"
GENERAL = "general"
def classify_query(query: str) -> QueryType:
"""Use a fast model to classify the incoming query."""
response = client.chat.completions.create(
model="gpt-4o-mini", # cheap + fast for classification
messages=[
{
"role": "system",
"content": (
"Classify the customer query into exactly one category: "
"'refund', 'technical', or 'general'. Respond with only the category name."
),
},
{"role": "user", "content": query},
],
)
raw = response.choices[0].message.content.strip().lower()
return QueryType(raw)
SYSTEM_PROMPTS = {
QueryType.REFUND: (
"You are a billing specialist. Help the customer with their refund request. "
"Be empathetic, explain the refund policy clearly, and collect any required information."
),
QueryType.TECHNICAL: (
"You are a senior technical support engineer. Diagnose the issue step by step. "
"Ask clarifying questions and provide actionable solutions."
),
QueryType.GENERAL: (
"You are a helpful customer service agent. Answer general questions concisely "
"and direct the customer to the right resource if needed."
),
}
def route_and_respond(query: str) -> str:
"""Classify the query and route to the appropriate specialised handler."""
query_type = classify_query(query)
system_prompt = SYSTEM_PROMPTS[query_type]
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": query},
],
)
return response.choices[0].message.content
# Usage
answer = route_and_respond("I was charged twice for my subscription last month.")4. Cost-Optimised Routing Pattern
Route by query complexity to balance cost vs quality:
def route_by_complexity(query: str) -> str:
"""Use a cheap model for easy queries, expensive model for hard ones."""
complexity = classify_complexity(query) # 'simple' or 'complex'
model = "claude-haiku-4-5" if complexity == "simple" else "claude-sonnet-4-5"
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": query}],
)
return response.choices[0].message.contentThis pattern can reduce inference costs by 60-80% while maintaining quality on complex queries.
5. Best Practices
- Keep categories mutually exclusive: Ambiguous boundaries cause misclassification. Define categories carefully.
- Add a catch-all fallback: Always handle
unknownor low-confidence classifications gracefully. - Validate classifier accuracy first: Before building all downstream handlers, confirm the classifier achieves >90% accuracy on a representative sample.
- Log classification decisions: Debugging misroutes is much easier with a trace of what category was assigned.
- Prefer traditional classifiers for high-throughput paths: A fine-tuned BERT classifier is 100× cheaper and faster than an LLM for classification — use LLM classification only when categories are nuanced or frequently changing.
- Use confidence thresholds: If classifier confidence is low, route to a human or a more capable model.