
AI evaluation, once considered a secondary cost to model training, has rapidly escalated into a primary expenditure, transforming into a significant compute bottleneck for the development of advanced AI systems. This shift profoundly impacts who can participate in cutting — edge AI research and development, as the financial and computational burdens of robust evaluation become prohibitive for many.
Concrete figures underscore this growing challenge. The Holistic Agent Leaderboard (HAL) recently incurred approximately $40,000 for 21,730 agent rollouts across nine distinct models and benchmarks. By April 2026, HAL expanded to 26,597 rollouts, with independent reproductions by Ndzomga costing $46,000 across 242 agent runs. A single GAIA run on a frontier model can cost $2,829 before caching. Exgentic's $22,000 sweep revealed a 33 — fold cost spread, isolating scaffold choice as a primary cost driver. In scientific ML, evaluating one new architecture with "The Well" requires 960 H100-hours, escalating to 3,840 H100-hours for a full four-baseline sweep.
This evaluation cost crisis isn't entirely new. Even earlier static LLM benchmarks like Stanford's CRFM released HELM in 2022 revealed substantial costs, with API expenses ranging from $85 to $10,926 per model, and open models requiring 540 to 4,200 GPU-hours. IBM Research noted that evaluating Granite — 13B through HELM could consume up to 1,000 GPU hours. The aggregate costs for HELM across 30 models and 42 scenarios amounted to approximately $100,000. Crucially, Perlitz et al. (2023, 2024) observed that for small models, evaluation costs "may even surpass those of pretraining when evaluating checkpoints," making evaluation a significant multiplier on training expenses during development.
Faced with escalating costs for static benchmarks, the AI community developed innovative compression techniques. Perlitz et al. found a striking 100x to 200x reduction in compute could preserve nearly the same model ordering within HELM, leading to the Flash — HELM procedure for coarse — to-fine evaluations. Other initiatives, such as tinyBenchmarks, compressed MMLU from 14,000 items to just 100 anchor items with about a 2% error rate. Similarly, the Open LLM Leaderboard saw a reduction from 29,000 examples to 180, while Anchor Points demonstrated that as few as 1 to 30 examples could effectively rank-order 87 language — model and prompt pairs on GLUE.
However, the efficacy of these compression tricks has sharply diminished as benchmarks transitioned from static predictions to agentic systems. Agent evaluations are inherently more complex and messier, characterized by noise, high sensitivity to scaffold choice, and limited compressibility. Benchmarks involving training — in-the-loop are expensive, and repeated runs for reliability further multiply costs. Unlike static benchmarks that isolate a model, agent evaluations rarely benchmark "the model" in isolation, instead evaluating a complex product of the model, its scaffold, and its token budget. This intricate interplay is reflected in HAL's cost variations, where a single benchmark run can vary by four orders of magnitude across tasks and three orders within some benchmarks.
The escalating and unpredictable costs of agentic AI evaluation pose a significant barrier to entry, potentially centralizing advanced AI development among well-funded organizations. This bottleneck strains existing resources and risks stifling innovation by making it harder for smaller teams or academic researchers to rigorously test and iterate on novel agent architectures. As evaluation becomes an increasingly dominant line item in the overall development cycle, it shifts resource allocation away from training new models towards the intricate and costly process of assessing their capabilities, especially when reliability demands repeated, expensive runs.
Sources
Replies (0)
No replies in this topic yet.