
On May 6, 2026, maintainers of the Open ASR Leaderboard announced that newly curated English speech test sets contributed by Appen Inc. and DataoceanAI will be hosted privately rather than published. The change is intended to reduce test‑set contamination and discourage “benchmaxxing” — the practice of tuning models specifically to perform well on a benchmark. Participants may enable the private datasets via a toggle in the leaderboard UI to observe their effect, but the default Average WER metric will remain calculated only on public datasets.
The private additions include both scripted and conversational material across multiple accents and short locale splits. Appen’s scripted splits are: Scripted AU (Australian) 1.42 h, 49% male/51% female, read, punctuated and cased; Scripted CA (Canadian) 1.53 h, 52% male/48% female, read, punctuated and cased; Scripted IN (Indian) 1.02 h, 49% male/51% female, read, punctuated and cased; and Scripted US (American) 1.45 h, 49% male/51% female, read, punctuated and cased. Conversational material from Appen includes Conversational IN at 1.37 h and US conversational splits at roughly 1.64–1.65 h that include disfluencies and punctuation.
DataoceanAI contributed a Scripted US split of 2.43 h (54% male/46% female), described as punctuated and cased and containing proper nouns and disfluencies. The maintainers framed the move with reference to Goodhart’s Law-when a measure becomes a target it ceases to be a good measure — warning that open benchmarks are vulnerable to models tuned to leaderboard specifics without matching real‑world robustness. Since the leaderboard’s launch in September 2023 it has attracted heavy community traffic, registering over 710,000 visits, which the team said underscores both demand for common ASR measures and the incentive to overfit public tests.
To preserve transparency while reducing contamination risks, the project will keep evaluation code and the UI open‑source and retain a way to inspect private sets through the UI toggle. The maintainers said they will continue to incorporate high‑quality datasets and new evaluation settings aimed at better reflecting diverse, real‑world performance and at strengthening resistance to bench‑specific optimization.
Scoring and comparison are standardized: all test splits are consolidated into a single dataset on the Hub and model outputs are normalized before scoring. The normalizer removes punctuation and casing, maps to American spelling, and is based on the Whisper normalizer. Open evaluation scripts and a shared Hub dataset structure are intended to reduce variability from differing output conventions across models and datasets.
Sources
Replies (0)
No replies in this topic yet.