Evaluating an agent.

Agent evaluation is harder than language-model evaluation because agents are stateful, non-deterministic, and use tools that change between runs. Reliability matters more than peak capability. This page covers the public benchmarks and how to read them.

Four dimensions of evaluation

For an LLM, the dominant metric is accuracy on a static benchmark. For an agent, accuracy is one dimension of four:

Capability: does the agent succeed on a representative task at all? This is what most public benchmarks measure.
Reliability: does it succeed consistently across runs? Non-determinism in the model and in tool responses means a single run is not a measurement.
Cost: per-task tokens including retries. Public benchmarks rarely report this; vendor pricing pages provide the per-token rates.
Latency: end-to-end wall-clock time. A pattern like parallelization may add token cost while reducing latency.

Public benchmarks

HELM

Project →Paper →

What it measures

Holistic Evaluation of Language Models. Multi-metric: accuracy, calibration, robustness, fairness, bias, toxicity, efficiency. Maintained by Stanford CRFM.

How to read it

A scoreboard rather than a single number. Use HELM to compare model behaviour across many axes; it does not measure agentic execution per se.

AgentBench

Project →Paper →

What it measures

Eight environments testing LLM-as-agent across operating system, database, web shopping, web browsing, and others. Reports a per-environment score and a composite.

How to read it

The composite hides large per-environment variance. Read the per-environment scores to understand whether the agent is good at the kind of task the application needs.

SWE-Bench

Project →Paper →

What it measures

Can the agent resolve real-world GitHub issues? Each task is a real bug fix from a popular open-source repository. Resolved-rate is the headline metric.

How to read it

Specific to coding agents. The Verified subset is the cleaner reference because the original SWE-Bench had test-leakage issues that the Verified subset filters out.

GAIA

Project →Paper →

What it measures

General AI Assistant tasks across three difficulty levels. Tasks require web access, multi-step reasoning, and tool use.

How to read it

Designed so that humans solve almost all tasks easily; agents do not. Use as a stress test for browsing and tool use rather than a leaderboard chase.

ToolBench

Project →Paper →

What it measures

Tool-calling competence over a corpus of real-world REST APIs. Measures pass-rate and ranking against a baseline.

How to read it

Specific to tool use, not full-agent loops. Useful when the limiting factor on the application is reliable tool selection and argument formation.

What public benchmarks miss

Public benchmarks measure agents in stable, well-instrumented environments. They cannot easily measure:

Reliability over long horizons. Most benchmarks are single-run scoring. Production reliability requires many runs of the same task.
Cost discipline. An agent that hits a 70% pass rate using ten times the tokens of an agent at 65% is rarely the better choice. Public benchmark leaderboards usually do not report token cost.
Tool-call recovery. The behaviour when a tool returns an unexpected schema is the kind of thing that decides real deployments. Benchmark environments tend to have stable tool surfaces.
Human-in-the-loop hand-off. When an agent is uncertain, does it escalate appropriately? Hard to capture in automated scoring.

Application-specific evaluation harnesses (a small representative test set with known-correct outputs, run repeatedly, with cost and latency captured) are the standard supplement to public benchmarks.

Glossary

See benchmark, reliability, latency, LLM as judge.