The Failure Pyramid
A five-level taxonomy of production AI agent failure modes, ranked by frequency. Drawn from observation across ~500 sites in our pipeline.
Production agents fail in roughly five categories, ranked by how often we see them in our pipeline. The most common failure is not the one engineers expect. The most common failure is silent drift: the agent completes the task, the task is wrong, no exception is raised. We catch about 80% of these now. Two years ago we caught maybe 40%.
Silent Drift
VERY COMMONThe agent completes the task. The task is wrong. No exception is raised. The hardest failure mode to catch and the most common one in our pipeline.
A content-update agent revised a pricing table and the new figures were internally consistent but referenced last quarter's source. The build passed. The page shipped. We caught it nine days later from a reader email. We catch about 80% of these now. Two years ago we caught maybe 40%.
Cost Cliff
COMMONOrchestrator-worker pattern decides to spawn N workers for a task that should have used three. Cost spikes by an order of magnitude. The task may or may not succeed; the bill is the part that hurts.
Last month our orchestrator spawned 47 worker calls for a task that should have used three. The bill for that single run was $4.20. Across our pipeline that is a five-figure annual leak if we do not catch it. We catch this with a post-run cost-per-task alert and a hard cap on max-workers-per-plan.
Confidence-Gate Breach
REGULARRouting pattern fails to recognise a boundary case and confidently routes to the wrong sub-agent. The downstream sub-agent does the wrong work without knowing it.
A routing classifier that worked across 95% of inputs failed on a class of edge cases where the routing decision and the model's confidence in the decision diverged. The agent acted with high confidence on a class it had never been trained for. The Confidence Gate is the pattern that fixes this.
Tool-Call Cascade
OCCASIONALA tool returns an unexpected schema (new field, missing field, type change). The agent does not handle the deviation and downstream tool calls compound the error into something difficult to diagnose.
A scraping tool started returning a new optional field. The agent's parsing assumed the field's absence and continued to behave as if it was absent on every call. Downstream summarisation included stale data for two weeks. We caught it from the data, not the exception log.
Context-Window Blow-Out
RARELong-running agent exhausts its context window. The model hallucinates from the truncation point. The output is plausible but wrong, and the wrongness is invisible without comparison against a ground truth.
An agent doing iterated research on a 12-source corpus exhausted context partway through source 9. Sources 10-12 were summarised against a hallucinated continuation of source 9. The summaries were superficially fine. They were wrong. This is the rarest failure mode in our pipeline because we cap iteration depth, but it is the hardest to diagnose when it does happen.

Oliver runs Digital Signet, a research and product studio that operates ~500 production sites with AI agents as the engineering layer. The Digital Signet portfolio is built using a continuous AI-agent build pipeline, one of the largest agent-operated publishing operations on the open web. The handbook draws directly from those deployments: real cost data, real failure modes, real recovery patterns.