IBM Research and Hugging Face introduces VAKRA: a new benchmark for evaluating AI agents in enterprise environments

News

4/24/2026, 3:40:47 PM

IBM Research and Hugging Face introduces VAKRA: a new benchmark for evaluating AI agents in enterprise environments

IBM Research and Hugging Face jointly introduced VAKRA, an innovative benchmark designed for a comprehensive evaluation of AI agents' reasoning capabilities and effective tool utilization in real-world enterprise settings. Published on April 15, 2026, VAKRA is unique in that it measures not isolated skills, but the reliability of executing complex multi-step workflows that require deep logical inference and effective interaction with diverse tools. To achieve this, the benchmark provides a unique executable environment with over 8000 locally hosted APIs, backed by real databases from 62 domains and extensive collections of thematic documents. VAKRA tasks require constructing reasoning chains of 3-7 steps, combining both structured API interaction and unstructured data search.

Unlike traditional benchmarks, which typically test individual functions, VAKRA focuses on end-to-end reliability of executing complex tasks through complete execution traces. Current analysis shows that existing AI models exhibit relatively low performance on VAKRA, highlighting the benchmark's high complexity and the need for further research in AI agents. This tool is of key importance for developers, as it allows for identifying the real limitations of AI agents in conditions closely approximating an enterprise environment. The detailed dataset and analysis of various failure types provided by VAKRA serve as a valuable foundation for improving the design and development of agents capable of working more effectively with tools and processing multi-step queries.

The VAKRA benchmark structure includes four main tasks, each aimed at testing different aspects of agents' capabilities. One example is the task "API Chaining using Business Intelligence APIs", containing 2077 test scenarios across 54 domains. These scenarios require 1 to 12 sequential tool calls for their resolution. Each such test scenario is linked to a unique JSON data source, which is initialized via a special tool `get_data(tool_universe_id=id)`. This mechanism returns a lightweight preliminary representation of the data, while the full dataset is stored on the server, which prevents inefficient transfer of large volumes of information via the MCP protocol.

An important element of VAKRA are the significantly expanded SLOT — BIRD and SEL — BIRD tool collections. The SLOT — BIRD collection offers 7 general-purpose tools designed for data manipulation, such as filtering functions. SEL — BIRD complements it with specialized capabilities, replacing the general `retrieve_data` function with more specific, query-oriented retrieval functions.

Sources

Hugging Face Blog · 4/15/2026

Replies (0)

No replies in this topic yet.

Back