Parallelisation Pattern: Throughput, Concurrency, and the Wall

When to use it

Two cases. The first: you need throughput on a task where each input is independent. Run N at a time, take all the answers. This is the obvious case and the one that most teams reach for first.

The second: you need vote-based reliability on a high-stakes classification. Route the same input to N models, take the majority vote, escalate ties. We use this in the build pipeline for "does this page contain a regression" classification, where a false negative is expensive.

When not to use it

Tasks where the bottleneck is a single tool downstream. Parallel models still serialise on a single tool. If your tool serialises, you have made the system slower, not faster, and you are paying for N model calls plus orchestration overhead.

Generative tasks (summarisation, writing). Votes do not converge cleanly across generative outputs. Even if you take the "best" one by some scoring metric, you have likely run an evaluator-optimiser pattern in disguise. Be explicit about which pattern you are running.

Production cost data

Concurrency ceiling in our pipeline lands at 8. Past 8, the throughput we measure starts to fall, not rise: orchestration and synchronisation overhead exceed the parallelisation gain. The exact number is pipeline-dependent (your tools, your network, your model providers) but the shape is consistent: there is always a ceiling, find it by experiment.

Vote-based parallelisation, in our experience, drops error rate by roughly half versus single-call for classification tasks at N=5. Beyond N=5 the error rate continues to drop but the marginal benefit per call is small. Five is our default.

Anti-patterns

Parallelising tool-bound work. If the slowest call in the chain is a serialised tool, the parallel models wait in queue. You have spent more, not gained more.
Voting on generative output. Vote on classifications, structured outputs, or yes/no decisions. Votes on prose are vibes.
No tie-break. Majority vote needs a defined behaviour for ties. Pick one before you ship: escalate to a stronger model, fall through to a different pattern, or default to the most conservative answer.

Sample code

# Parallel fan-out with vote aggregation.
def parallel_classify(input_text, n=5):
    futures = [model.classify(input_text) for _ in range(n)]
    results = await asyncio.gather(*futures)
    return majority_vote(results)

Cross-pattern interactions

Parallelisation is often the worker layer of an orchestrator-worker pattern: the orchestrator dispatches sub-tasks, the workers fan out for the high-stakes ones. It also feeds into evaluator-optimiser: vote on the optimiser output, take the consensus.

Engineering FAQ

Is parallelisation the same as orchestrator-worker?

No. Orchestrator-worker decomposes a task into different sub-tasks. Parallelisation runs the same task across many models. The shapes look similar at the dispatch layer but the planning step is different. In orchestrator-worker the planner does work; in parallelisation the work is the fan-out itself.

When does vote-based parallelisation pay off?

When the cost of a wrong answer is high and the cost of paying for N model calls is low. We use vote-based parallelisation on classification tasks where one wrong label cascades into a downstream Cost Cliff. We do not use it on summarisation; the votes do not converge cleanly.

What is the concurrency ceiling?

Around 8 in our pipeline, but the real ceiling is whatever your downstream tools can handle. Eight parallel model calls is fine; eight parallel calls into a tool that serialises requests turns into eight serial calls plus orchestration overhead. Profile the tool before you scale.

Parallelisation