IBM Research launches Open Agent Leaderboard to benchmark full AI agent systems

News

5/18/2026, 2:57:09 PM

IBM Research launches Open Agent Leaderboard to benchmark full AI agent systems

On May 18, 2026, IBM Research released the Open Agent Leaderboard and the Exgentic evaluation framework, an open benchmark that assesses entire AI agent systems — measuring performance and operational cost across realistic, multi — step tasks.

IBM Research published the Open Agent Leaderboard on May 18, 2026, introducing an open benchmark that evaluates entire AI agent systems rather than just their underlying models. The release, attributed to researcher Elron Bandel (IBM Research), includes the Exgentic framework to run and reproduce evaluations and an accompanying paper documenting methodology and results; the team says all materials are open from day one. This approach targets the practical question of how well agents work in real deployments and what they cost to operate.

The leaderboard treats agents as full systems, measuring what tools they can call, how they plan multi — step work, how they store and use memory, and how they recover from failures. Each submission receives paired measurements of quality and cost so builders can see not only which approaches perform best but which are financially sensible to deploy. The project frames generality as a spectrum linked to practical capability and expense, rather than a single benchmark score.

To exercise broad, realistic workloads the team assembled six benchmarks that probe different task types. Included community datasets and scenarios are: SWE-Bench Verified (fixing real bugs in code repositories); BrowseComp+ (researching complex questions across the web); AppWorld (completing personal tasks across hundreds of apps and actions); tau2 — Bench Airline & Retail (customer service under company policies); and tau2 — Bench Telecom (technical support under company policies). These selections are intended to expose agents to varied tools, rules and constraints.

A unified protocol gives every benchmark the same structure — task (what to do), context (what to know) and a set of actions (what’s allowed)—so agents interact through a single standardized interface. Standardizing required reconciling each benchmark’s assumptions and interaction patterns with different agent designs; Exgentic is provided to run those standardized evaluations and reproduce results, and the paper records the methodology and outcomes.

The leaderboard departs from conventional model — only evaluations by making the full agent stack the unit of comparison. IBM Research positions this as a stronger test of “generality” because it evaluates agents across diverse, unfamiliar settings with different tools, rules and constraints, and it reports cost alongside performance. The authors acknowledge the leaderboard does not cover every future agent capability but present it as a more practical cross — setting assessment than earlier benchmarks.

For builders and system architects, the leaderboard exposes which system components — tools, planning, memory and failure recovery — drive results and highlights trade — offs between generality and cost. By reporting both quality and operational expense, the project lets teams compare deployment — worthiness rather than raw capability and iterate on system — level design choices with clearer signals about their impact on performance and cost.

Sources

Hugging Face Blog · 5/18/2026

Replies (0)

No replies in this topic yet.

Back