Parallelization
1. What is Parallelization?
Parallelization has two distinct variants:
Sectioning — Split and conquer
Break a task into independent subtasks that run concurrently, then aggregate results.
┌──────────────▶ LLM Worker A ─────────────┐
│ │
Input ───┼──────────────▶ LLM Worker B ─────────────┼──▶ Aggregator ──▶ Output
│ │
└──────────────▶ LLM Worker C ─────────────┘
Voting — Run multiple times, pick the best
Run the same task multiple times independently, then aggregate results via majority vote or synthesis.
┌──────────────▶ LLM Run 1 ─────────────┐
│ │
Input ───┼──────────────▶ LLM Run 2 ─────────────┼──▶ Vote / Aggregate ──▶ Final Output
│ │
└──────────────▶ LLM Run 3 ─────────────┘
2. When to Use Parallelization
Sectioning — good fit when: - Subtasks are genuinely independent (one doesn’t affect another) - Each aspect of a complex task benefits from dedicated, focused attention - Latency matters and subtasks can be run concurrently
Voting — good fit when: - High accuracy is critical and multiple independent attempts improve confidence - The task has stochastic variability you want to reduce - You need to detect adversarial content or edge cases
Examples — Sectioning: - Run a guardrails check (LLM A) in parallel with generating the core response (LLM B) - Evaluate multiple dimensions of an LLM response simultaneously (accuracy, tone, relevance, safety)
Examples — Voting: - Code vulnerability review: 3 different prompts each scan for different vulnerability classes; flag if any finds a problem - Content moderation: Multiple evaluators assess different risk dimensions; require a threshold of votes to flag
3. Implementation Pattern
import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI()
async def llm_call_async(prompt: str, system: str = "") -> str:
messages = []
if system:
messages.append({"role": "system", "content": system})
messages.append({"role": "user", "content": prompt})
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
)
return response.choices[0].message.content
# ── Sectioning Example ─────────────────────────────────────────────────────
async def review_code_sectioned(code: str) -> dict:
"""Evaluate code across 3 dimensions in parallel."""
security_task = llm_call_async(
f"Review this code for security vulnerabilities only:\n\n{code}",
system="You are a security expert. Focus only on security issues."
)
performance_task = llm_call_async(
f"Review this code for performance issues only:\n\n{code}",
system="You are a performance engineer. Focus only on performance."
)
style_task = llm_call_async(
f"Review this code for style and maintainability only:\n\n{code}",
system="You are a senior engineer. Focus only on code quality."
)
security, performance, style = await asyncio.gather(
security_task, performance_task, style_task
)
return {"security": security, "performance": performance, "style": style}
# ── Voting Example ──────────────────────────────────────────────────────────
async def classify_with_voting(text: str, n_votes: int = 3) -> str:
"""Run classification n times and return the majority vote."""
tasks = [
llm_call_async(
f"Is this content safe or unsafe? Respond with only 'safe' or 'unsafe'.\n\n{text}"
)
for _ in range(n_votes)
]
votes = await asyncio.gather(*tasks)
normalised = [v.strip().lower() for v in votes]
return max(set(normalised), key=normalised.count) # majority vote
# Usage
results = asyncio.run(review_code_sectioned("def get_user(id): return db.query(f'SELECT * FROM users WHERE id={id}')"))
verdict = asyncio.run(classify_with_voting("Buy cheap meds online no prescription"))4. Guardrail + Response in Parallel
A common production pattern: run the guardrail check concurrently with the main response generation. If the guardrail flags the input, discard the main response.
async def safe_respond(user_input: str) -> str:
guardrail_task = llm_call_async(
f"Does this input contain harmful, illegal, or inappropriate content? "
f"Reply only YES or NO.\n\nInput: {user_input}"
)
response_task = llm_call_async(
user_input,
system="You are a helpful assistant."
)
guardrail_result, response = await asyncio.gather(guardrail_task, response_task)
if guardrail_result.strip().upper().startswith("YES"):
return "I'm unable to help with that request."
return responseThis adds zero latency overhead compared to running guardrails sequentially.
5. Best Practices
- Use
asyncio.gatherfor true concurrency: Parallel HTTP calls with async I/O, not threads. - Set a concurrency limit: Use
asyncio.Semaphoreto avoid hitting API rate limits. - Design subtasks to be truly independent: If Task B needs Task A’s output, it’s a chain, not parallelization.
- For voting, use odd numbers: 3 or 5 votes avoid ties in binary decisions.
- Aggregate thoughtfully: For complex outputs, a synthesis LLM call is better than naive concatenation.
- Cost consideration: N parallel calls cost N times as much. Verify the quality improvement justifies it.