Building Effective Agents
PATTERN · COST STEADY

Evaluator-Optimiser

One model produces, a second critiques and asks for revision. The steadiest cost profile of the five. The one that loops if you let it.

Oliver Wakefield-SmithBy Oliver Wakefield-Smith, Digital Signet
Last verified April 2026

When to use it

Tasks with clear quality criteria where a critique loop genuinely improves output. Code review (the critic checks for missed edge cases). Content QA (the critic checks for factual claims). Structured-output validation (the critic checks the schema and the data quality, not just the schema).

We use this pattern in the QA stage of our build pipeline. The optimiser produces the page; the evaluator critiques it against a list of failure-pyramid signals. If the critic is satisfied, the page ships. If not, the optimiser revises.

When not to use it

Tasks where the evaluator and the optimiser are running on the same model with the same prompt context. You are paying twice for the same opinion. The pattern works only when the critique introduces a new perspective, either through a different model, a different prompt frame, or an external check (a tool, a verifier, a regex).

Tasks with no clear quality criteria. If the critic does not have a yardstick, the pattern degenerates into vibes-as-judgement, and from there into the Loop.

Production cost data

Steadiest cost profile of the five patterns in our pipeline. Cost-per-task variance is the lowest, because the iteration count is bounded. At our default cap of 5 iterations, P50 iterations is 2, P95 is 4, P99 hits the cap.

The Loop incident: in one site the evaluator-optimiser pattern decided the optimiser's output was always-not-quite-good-enough and kept asking for revisions. The agent ran 14 iterations before hitting the budget cap. The output of iteration 14 was no better than iteration 3. After we shipped the marginal-improvement detector and the iteration cap, the median iteration count dropped from 3 to 2 and the P99 stopped hitting the cap.

The full Note is at /operator-notes/the-day-our-evaluator-optimised-itself-into-a-loop/.

Anti-patterns

  • Same model, same prompt. Evaluator and optimiser must differ on at least one axis: model, prompt frame, or grounding. Otherwise you are doing reflection, expensively.
  • No marginal-improvement detector. The iteration cap is a safety net. Without the detector you will hit the cap on the long tail and pay for it.
  • Critic-as-vibes. If the critic does not output a structured verdict (acceptable / not acceptable, with reason), the pattern is unmonitored. We require the critic to return a JSON verdict.

Sample code

# Evaluator-Optimiser with iteration cap and marginal-improvement detector.
def critique_loop(task, max_iters=5, marginal=20):
    output = optimizer.produce(task)
    for i in range(max_iters):
        verdict = evaluator.review(task, output)
        if verdict.acceptable:
            return output
        new_output = optimizer.revise(task, output, verdict.feedback)
        if edit_distance(output, new_output) < marginal:
            return new_output
        output = new_output
    return output  # Cap reached

Cross-pattern interactions

Evaluator-optimiser is the natural QA layer for an orchestrator-worker pattern: the orchestrator runs the work, the evaluator-optimiser checks the synthesis. It also pairs with routing when the routing decision needs validation; the Confidence Gate is a degenerate evaluator-optimiser of one iteration.

Engineering FAQ

What is the difference between evaluator-optimiser and reflection?

Reflection is a single self-evaluation step the same model runs on its own output. Evaluator-optimiser uses two separate calls (or two separate models): one produces, one critiques. The pattern works because the critique is conditioned on a different prompt context, which gives a meaningful second opinion.

How do I prevent the evaluator-optimiser from looping?

Two layers. A hard cap on iterations (we use 5). A marginal-improvement detector: if iterations N and N+1 differ by less than X tokens of edit distance, accept iteration N+1 and stop. The detector is more important than the cap; the cap is the safety net.

When is the evaluator just expensive reflection?

When evaluator and optimiser run on the same model with the same prompt context. You are paying twice for the same opinion. Either change the model on one side, change the prompt context, or do not run the pattern.

Read next

The Failure Pyramid

The Loop is the named failure mode for this pattern.

The Confidence Gate

An evaluator-optimiser of one iteration, applied to routing.

Oliver Wakefield-Smith, Founder of Digital Signet
ABOUT THE AUTHOR
Oliver Wakefield-Smith
Founder, Digital Signet

Oliver runs Digital Signet, a research and product studio that operates ~500 production sites with AI agents as the engineering layer. The Digital Signet portfolio is built using a continuous AI-agent build pipeline, one of the largest agent-operated publishing operations on the open web. The handbook draws directly from those deployments: real cost data, real failure modes, real recovery patterns.