
A 64‑member consortium published SOOHAK, a 439‑problem, handwritten benchmark split into a 340‑problem research‑level Challenge and a 99‑problem Refusal set to evaluate LLMs’ deep reasoning and ability to flag unsolvable or flawed math problems.
A consortium of 64 mathematicians has released SOOHAK, a 439‑problem benchmark of original, handwritten mathematics designed to probe research‑level reasoning and a model’s ability to refuse or flag unsolvable problems. The collection was created because existing competition exams no longer stress models as they reach higher performance, so SOOHAK targets genuinely novel and niche research questions that reveal gaps in current systems.
SOOHAK is divided into two deliberately different components: a 340‑problem Challenge set of graduate and research‑level questions, and a 99‑problem Refusal set composed of tasks that failed quality control due to missing assumptions or internal contradictions. Contributors included 38 professors, 25 PhD students and postdocs, and five IMO medalists; each attested they produced problems without AI assistance. Submissions passed through automated LLM checks, manual moderation and iterative revisions before inclusion.
Testing on the Challenge set shows that frontier proprietary models lead but still leave most problems unsolved. Google’s Gemini 3 Pro scored roughly 30 percent, GPT‑5 (versions 5.1/5.2) about 26 percent, and Anthropic’s Claude Opus 4.5 around 10 percent. Several open‑weight models — Kimi‑2.5, Qwen3 (235B) and GPT‑OSS‑120B-remained below 15 percent. The authors note that no tested model solved 124 of the Challenge tasks. By contrast, SOOHAK‑Mini, a companion collection aimed at olympiad and early college level, produced much higher and more clustered scores.
The Refusal set enforces a different behavior: models receive credit only when they correctly identify and name a flaw rather than outputting a solution. No model exceeded the 50 percent mark on this set. The open‑weight GLM‑5 performed best at just under 50 percent, ahead of GPT‑5 and Gemini 3 Pro, while the Qwen3 family fell below 3 percent and almost always failed to flag broken problems. Adjusted rankings in the report emphasize that strong problem‑solving accuracy does not guarantee reliable refusal behavior.
The authors observed divergent scaling trends: solution rates on the Challenge set rise nearly linearly with model size and with longer reasoning budgets, but refusal accuracy does not follow the same pattern. They also report that open‑weight systems appear to transfer less effectively to unpublished, niche research material, a factor that helps explain the wider performance gap at the research level compared with SOOHAK‑Mini.
A human baseline involved 25 participants across five groups, including IMO medalists and PhD mathematicians, who collectively solved 51 percent on a 79‑task subset; Gemini 3 Pro exceeded that combined human coverage at 61 percent on the same set. The study further found participants with Olympiad experience outperformed PhD researchers. For model builders, SOOHAK surfaces a concrete optimization target: current training and scaling strategies improve raw solution rates but do not reliably teach models to admit when a problem has no valid answer.
Sources
Replies (0)
No replies in this topic yet.