In quick
- OpenAI argues that SWE-bench Verified no longer shows genuine coding capability due to the fact that the standard is presumably polluted.
- It is now pressing SWE-bench Pro as harder replacement.
- Ratings plunged from ~ 70% to ~ 23% on the more recent standard,
The number that every significant AI laboratory has actually been utilizing to declare coding supremacy was simply stated worthless.
OpenAI released a post today revealing that SWE-bench Verified, the go-to standard for determining AI coding abilities, is so filled with problematic tests and training information leak that it no longer informs you anything beneficial about whether a design can in fact compose software application.
The benchmark works like this: Offer an AI a genuine GitHub problem from a popular open-source Python task, ask it to repair the bug without seeing the tests, and examine if its spot makes the stopping working tests pass without breaking anything else.
OpenAI developed SWE-bench Verified in August 2024 as a cleaner variation of the initial 2023 standard, hiring 93 software application engineers to filter out jobs that were difficult or badly developed.
The clean-up worked well adequate that every significant laboratory began mentioning ratings on it as evidence of development. When Anthropic released Claude Opus 4 in Might 2025, Decrypt reported that the design scored 72.5% on SWE-bench Verified, beating GPT-4.1’s 54.6% and Gemini 2.5 Pro’s 63.2%. It was the coding standard that mattered.
Ever Since, every AI laboratory from America to China has actually revealed the SWE efficiency to declare the throne as the very best design for coding abilities.
Now OpenAI states that race was partially a mirage. According to the report, the group investigated 138 jobs that GPT-5.2 regularly stopped working throughout 64 independent runs, and had 6 engineers examine every one. It eventually concluded that 59.4% of those jobs are broken.
About 35.5% have tests so directly composed that they need a particular function name never ever discussed in the issue description. Another 18.8% look for functions that weren’t part of the initial issue at all, collected from unassociated pull demands.
The contamination issue approximately works like this: SWE-bench pulls its issues from open-source repositories that a lot of AI business crawl when developing training sets. OpenAI checked whether GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash Sneak peek had actually seen the standard’s options throughout training. All 3 had.
Offered just a job ID and a quick tip, each design might recreate the precise code repair from memory, consisting of variable names and inline remarks that appear no place in the issue description. In one case, GPT-5.2’s chain-of-thought logs revealed it thinking that a particular specification should have been “included around Django 4.1”– an information discovered just in Django’s release notes, not the job description. It was responding to a concern it had actually currently seen the response to.
OpenAI now suggests SWE-bench Pro, a more recent standard from Scale AI that utilizes more varied codebases and licenses that minimize training information direct exposure. The efficiency drop is disconcerting: designs that cleared 70% on the old Verified standard score around 23% on SWE-bench Pro’s public split, and even less on its personal jobs.
On the existing public SWE-bench Confirmed leaderboard, OpenAI is far from the standard’s podium. Retiring a criteria where you’re losing and backing one where everybody begins at 23% resets the scoreboard at a practical minute and makes the rivals’ claims less remarkable.
This is specifically essential thinking about that the much expected more recent variation of DeepSeek is reported to beat or get incredibly near to American ai designs, specifically in agentic and coding jobs with a totally free, open-source design. That design might be days far from release, and SWE-bench Verified can be an essential metric to determine its quality.
OpenAI stated it’s developing independently authored examinations that will not be launched before screening, indicating its GDPVal task where domain specialists compose initial jobs graded by skilled human customers.
The benchmark issue is not brand-new, and is not distinct to coding. AI laboratories have actually cycled through numerous examinations, each beneficial till designs were trained on them or till the jobs showed too narrow.
However what makes this case significant is that OpenAI hyped SWE-bench Verified, promoted it throughout design releases, and is now openly recording how completely it has actually stopped working– consisting of by revealing their own design unfaithful on it.
Daily Debrief Newsletter
Start every day with the leading newspaper article today, plus initial functions, a podcast, videos and more.
