
A recent benchmark roundup put Claude Code at 87.6% on SWE‑bench Verified while OpenAI reports GPT‑5.5 leading Terminal‑Bench at 82.7%, but those headline numbers now sit alongside a major audit that changes how the Verified metric should be read. The OpenAI Frontier Evals disclosure on Feb 23, 2026 flagged systemic issues in SWE‑bench Verified that affect how vendors and teams interpret comparative scores. This matters for builders and ML engineers deciding which agents to trust for production code tasks.
OpenAI’s Frontier Evals team audited SWE‑bench Verified by examining 138 of the hardest problems across 64 independent runs and concluded that 59.4% of those cases contained fundamentally flawed or unsolvable test cases. The audit cited examples such as checks for exact function names never mentioned in prompts and unrelated behavior copied from upstream pull requests. Auditors also found that GPT‑5.2, Claude Opus 4.5 and Gemini 3 Flash could reproduce gold‑patch solutions verbatim from just the task ID, prompting OpenAI to stop reporting Verified scores and to recommend SWE‑bench Pro instead.
SWE‑bench Pro is presented as a more robust replacement and contains 1,865 tasks split into a 731‑task public set, an 858‑task held‑out set and a 276‑task commercial/private set drawn from 18 proprietary startup codebases. Early unified SWE‑Agent scaffold results cited in a Scale AI paper put frontier model top scores below 25%, with GPT‑5 at 23.3%. OpenAI lists GPT‑5.5 at 58.6% (Public), Anthropic lists Claude Opus 4.7 at 64.3%, and Gemini 3.1 Pro at 54% (comparable without scaffold and split context).
The coding‑agent market has advanced from inline autocomplete to systems that read GitHub issues, navigate multi‑file repositories, write fixes, execute tests and open pull requests. By early 2026 roughly 85% of developers reported regularly using some form of AI assistance. The field has fragmented into distinct archetypes — terminal agents, AI‑native IDEs, cloud‑hosted autonomous engineers and open‑source frameworks that let teams swap models and harnesses — complicating apples‑to‑apples comparisons.
For practitioners the practical takeaway is straightforward: SWE‑bench Verified scores can still serve as directional signals but require explicit caveats about contamination, test quality and model memorization. SWE‑bench Pro is the recommended successor, but its public/held‑out/commercial splits, scaffold variants and optimized harnesses make cross‑run comparisons fragile. Teams should demand disclosure of scaffold, split and harness details and validate candidate agents using hard held‑out, end‑to‑end tests before adopting them in production workflows.
Sources
Replies (0)
No replies in this topic yet.