ORIGINAL FRAMEWORK

The Failure Pyramid

A five-level taxonomy of production AI agent failure modes, ranked by frequency. Drawn from observation across ~500 sites in our pipeline.

By Oliver Wakefield-Smith, Digital Signet

Last verified April 2026

Production agents fail in roughly five categories, ranked by how often we see them in our pipeline. The most common failure is not the one engineers expect. The most common failure is silent drift: the agent completes the task, the task is wrong, no exception is raised. We catch about 80% of these now. Two years ago we caught maybe 40%.

Fig. The five levels of the Failure Pyramid.

Silent Drift

VERY COMMON

The agent completes the task. The task is wrong. No exception is raised. The hardest failure mode to catch and the most common one in our pipeline.

A content-update agent revised a pricing table and the new figures were internally consistent but referenced last quarter's source. The build passed. The page shipped. We caught it nine days later from a reader email. We catch about 80% of these now. Two years ago we caught maybe 40%.

Recovery: Output validation against a fresh ground-truth pull. Cross-checking against last-known-good with a delta threshold. A second model in evaluator mode for content-class tasks where drift is high-cost.

Cost Cliff

COMMON

Orchestrator-worker pattern decides to spawn N workers for a task that should have used three. Cost spikes by an order of magnitude. The task may or may not succeed; the bill is the part that hurts.

Last month our orchestrator spawned 47 worker calls for a task that should have used three. The bill for that single run was $4.20. Across our pipeline that is a five-figure annual leak if we do not catch it. We catch this with a post-run cost-per-task alert and a hard cap on max-workers-per-plan.

Recovery: Cap on workers-per-plan inside the orchestrator's planning prompt. Per-task cost ceiling enforced at the dispatch layer. Post-run cost alert on tasks that exceed P95.

Confidence-Gate Breach

REGULAR

Routing pattern fails to recognise a boundary case and confidently routes to the wrong sub-agent. The downstream sub-agent does the wrong work without knowing it.

A routing classifier that worked across 95% of inputs failed on a class of edge cases where the routing decision and the model's confidence in the decision diverged. The agent acted with high confidence on a class it had never been trained for. The Confidence Gate is the pattern that fixes this.

Recovery: Confidence threshold gate before the routing branch fires. Fall-through path or human escalation when confidence is below threshold. The inaugural Pattern Deep Dive is on this exact pattern; see The Confidence Gate.

Tool-Call Cascade

OCCASIONAL

A tool returns an unexpected schema (new field, missing field, type change). The agent does not handle the deviation and downstream tool calls compound the error into something difficult to diagnose.

A scraping tool started returning a new optional field. The agent's parsing assumed the field's absence and continued to behave as if it was absent on every call. Downstream summarisation included stale data for two weeks. We caught it from the data, not the exception log.

Recovery: Schema-validate every tool response. Pin the schema version in the tool definition and treat schema drift as a tool failure. Ship runtime contracts on tool I/O for any tool not under your control.

Context-Window Blow-Out

RARE

Long-running agent exhausts its context window. The model hallucinates from the truncation point. The output is plausible but wrong, and the wrongness is invisible without comparison against a ground truth.

An agent doing iterated research on a 12-source corpus exhausted context partway through source 9. Sources 10-12 were summarised against a hallucinated continuation of source 9. The summaries were superficially fine. They were wrong. This is the rarest failure mode in our pipeline because we cap iteration depth, but it is the hardest to diagnose when it does happen.

Recovery: Hard cap on iteration depth and on tokens-per-task. Summarisation checkpoint that compresses history before context fills, with the compression tested against the originals. Pin to a model with the context window your task actually needs (and no larger; pricing scales).

ABOUT THE AUTHOR

Oliver Wakefield-Smith

Founder, Digital Signet

Oliver runs Digital Signet, a research and product studio that operates ~500 production sites with AI agents as the engineering layer. The Digital Signet portfolio is built using a continuous AI-agent build pipeline, one of the largest agent-operated publishing operations on the open web. The handbook draws directly from those deployments: real cost data, real failure modes, real recovery patterns.

oliver@digitalsignet.com|About this site|Digital Signet