$building.effective.agents
Menu
Last verified: April 2026
· Pattern P05

Evaluator-optimizer.

A generator LLM proposes a candidate. An evaluator LLM critiques it. If the evaluator rejects, its feedback informs the next generator pass. The loop continues until acceptance or a budget cap.

Definition

“In the evaluator-optimizer workflow, one LLM call generates a response while another provides evaluation and feedback in a loop. This workflow is particularly effective when we have clear evaluation criteria, and when iterative refinement provides measurable value.”

From Anthropic, “Building Effective Agents”, December 2024. Self-refinement on language model outputs is described independently in Madaan et al., “Self-Refine” (2023).

What it does

Two roles, in a loop. The generator emits a candidate. The evaluator applies a rubric and decides accept or reject. On reject, the evaluator returns feedback that the generator uses to produce the next candidate. The loop terminates on accept or when a hard cap on iteration count is reached.

inputgeneratorproposes draftevaluatoraccept / rejectacceptreject + feedbackcap iterations to avoid the infinite-rewrite loop

The generator and evaluator can be the same model with different prompts, two different models, or a model paired with a deterministic checker (a unit test, a schema validator, a compiler). The pattern subsumes “LLM as judge” as the evaluator role.

When it is appropriate

  • The acceptance criteria are clear enough that an evaluator can produce a reliable verdict. Code with passing tests, prose with a checklist, structured output with a schema.
  • The first draft is plausibly imperfect and feedback can plausibly improve it. If the generator has no path to improvement, iteration is wasted.
  • Latency budget allows for iteration. Each iteration is at least a generator and an evaluator call.

Public examples

  • The Anthropic cookbook hosts the canonical reference implementation.
  • Self-Refine (Madaan et al., 2023) reports gains across mathematical reasoning, code generation, and dialogue tasks when iterative self-feedback is applied.
  • SWE-Bench evaluation harnesses use the equivalent loop: generate patch, run tests, on failure feed test output back as evaluator feedback.
  • LangGraph's reflection tutorial walks through a generator-and-reflector graph.

Cost considerations

Cost is dominated by iteration count. With k average iterations, total cost is approximately k * (generator + evaluator). The variance on k is large: easy inputs converge in one iteration, difficult inputs hit the cap. Vendor pricing pages (Anthropic, OpenAI) make per-iteration cost trivial to compute; the difficulty is forecasting k.

Two production patterns reduce variance. First, a marginal improvement detector: if iterations n and n-1 are sufficiently similar, the loop accepts even if the evaluator continues to flag minor issues. Second, a different model for the evaluator: a cheaper evaluator with a stricter rubric outperforms a generator acting as its own judge in many tasks (the model that produced the candidate is not always the best critic of it).

Failure mode

The named failure mode is the refinement loop: the evaluator rejects every candidate forever, each candidate is no better than the last, and the loop runs to the budget cap. Mitigation requires both a hard iteration cap and a marginal- improvement detector. Without both, the pattern is unbounded in cost.

Glossary

See evaluator-optimizer, LLM as judge, self-refine.

Foundational definitions on the sibling reference site: whatisanaiagent.com glossary.

Read next