How many annotators are needed for AI: Google Research proposes a new standard for creating benchmarks

News

4/25/2026, 3:11:04 PM

Researchers Flip Korn and Chris Welty from Google Research introduced a new open-source analytical system on March 31, 2026, designed to address a pressing problem in modern machine learning — the crisis of reproducibility of results. In the artificial intelligence industry, reproducibility defines how easily other teams can replicate experiments with the same data and settings, obtaining identical outcomes. The main difficulty lies in the fact that reference data for testing models relies on human evaluations, and people, unlike machines, approach problems from different perspectives and often disagree with each other.

In their scientific paper, the researchers describe this problem as a trade-off between "the forest and the trees," meaning a complex choice between the total number of elements being evaluated and the number of annotators per individual piece of information. To find the optimal balance and maximize the efficient use of research budgets, a special simulator was developed, which is now available to the entire developer community on GitHub. This tool allowed for a large-scale stress test, varying two main parameters: scale, which is the total number of elements ranging from a modest budget of 100 units to large arrays of 50,000, and crowd, which determines the number of people evaluating one element from 1 person to 500.

To ensure the simulator's effectiveness on real subjective tasks, the Google Research team used several large-scale and diverse datasets. The first was a toxicity assessment dataset, comprising 107,620 social media comments, annotated by 17,280 people. The second was the DICES dataset, designed for evaluating the safety of conversational AI: it consists of 350 conversations with chatbots, which were analyzed across 16 safety parameters by 123 annotators. The third cross-cultural dataset, named D3code, contained 4,554 elements, assessed for offensiveness by 4,309 annotators from 21 countries with strict adherence to gender and age balance.

The use of such a comprehensive database allowed scientists to examine how the system behaves when evaluating complex or highly imbalanced data. Specifically, the researchers simulated situations of critical imbalance, where 99 percent of the array consists of noise or spam, and only 1 percent represents actual importance for model training. Additionally, the impact of expanding the number of evaluation categories on annotator behavior was studied, for example, when dividing tags into "neutral," "slightly offensive," and "explicitly toxic."

A key finding of the study, challenging the current status quo in machine learning evaluation, is that there is no universal approach to annotation. The accepted standard of several annotators per element is unequivocally insufficient for capturing natural human disagreements, especially when identifying hate speech or analyzing subtle cultural biases. The new framework proposed by the Google Research team provides developers with a clear roadmap for creating more reliable, cost-effective, and easily reproducible tests for artificial intelligence, which do not filter out subjectivity as an annoying error but rather utilize the objective diversity of opinions as the most valuable signal.

Sources

Google Research topic stream · 3/31/2026

Replies (0)

No replies in this topic yet.

Back