Evaluator-Optimiser Pattern: Production Cost and the Loop Failure

When to use it

Tasks with clear quality criteria where a critique loop genuinely improves output. Code review (the critic checks for missed edge cases). Content QA (the critic checks for factual claims). Structured-output validation (the critic checks the schema and the data quality, not just the schema).

We use this pattern in the QA stage of our build pipeline. The optimiser produces the page; the evaluator critiques it against a list of failure-pyramid signals. If the critic is satisfied, the page ships. If not, the optimiser revises.

When not to use it

Tasks where the evaluator and the optimiser are running on the same model with the same prompt context. You are paying twice for the same opinion. The pattern works only when the critique introduces a new perspective, either through a different model, a different prompt frame, or an external check (a tool, a verifier, a regex).

Tasks with no clear quality criteria. If the critic does not have a yardstick, the pattern degenerates into vibes-as-judgement, and from there into the Loop.

Production cost data

Steadiest cost profile of the five patterns in our pipeline. Cost-per-task variance is the lowest, because the iteration count is bounded. At our default cap of 5 iterations, P50 iterations is 2, P95 is 4, P99 hits the cap.

The Loop incident: in one site the evaluator-optimiser pattern decided the optimiser's output was always-not-quite-good-enough and kept asking for revisions. The agent ran 14 iterations before hitting the budget cap. The output of iteration 14 was no better than iteration 3. After we shipped the marginal-improvement detector and the iteration cap, the median iteration count dropped from 3 to 2 and the P99 stopped hitting the cap.

The full Note is at /operator-notes/the-day-our-evaluator-optimised-itself-into-a-loop/.

Anti-patterns

Same model, same prompt. Evaluator and optimiser must differ on at least one axis: model, prompt frame, or grounding. Otherwise you are doing reflection, expensively.
No marginal-improvement detector. The iteration cap is a safety net. Without the detector you will hit the cap on the long tail and pay for it.
Critic-as-vibes. If the critic does not output a structured verdict (acceptable / not acceptable, with reason), the pattern is unmonitored. We require the critic to return a JSON verdict.

Sample code

# Evaluator-Optimiser with iteration cap and marginal-improvement detector.
def critique_loop(task, max_iters=5, marginal=20):
    output = optimizer.produce(task)
    for i in range(max_iters):
        verdict = evaluator.review(task, output)
        if verdict.acceptable:
            return output
        new_output = optimizer.revise(task, output, verdict.feedback)
        if edit_distance(output, new_output) < marginal:
            return new_output
        output = new_output
    return output  # Cap reached

Cross-pattern interactions

Evaluator-optimiser is the natural QA layer for an orchestrator-worker pattern: the orchestrator runs the work, the evaluator-optimiser checks the synthesis. It also pairs with routing when the routing decision needs validation; the Confidence Gate is a degenerate evaluator-optimiser of one iteration.

Engineering FAQ

What is the difference between evaluator-optimiser and reflection?

Reflection is a single self-evaluation step the same model runs on its own output. Evaluator-optimiser uses two separate calls (or two separate models): one produces, one critiques. The pattern works because the critique is conditioned on a different prompt context, which gives a meaningful second opinion.

How do I prevent the evaluator-optimiser from looping?

Two layers. A hard cap on iterations (we use 5). A marginal-improvement detector: if iterations N and N+1 differ by less than X tokens of edit distance, accept iteration N+1 and stop. The detector is more important than the cap; the cap is the safety net.

When is the evaluator just expensive reflection?

When evaluator and optimiser run on the same model with the same prompt context. You are paying twice for the same opinion. Either change the model on one side, change the prompt context, or do not run the pattern.

Evaluator-Optimiser