OpenAI Says Benchmark Used to Measure AI Coding Skill Is 'Contaminated'—Here's Why

In quick

OpenAI argues that SWE-bench Verified no longer shows genuine coding capability due to the fact that the standard is presumably polluted.
It is now pressing SWE-bench Pro as harder replacement.
Ratings plunged from ~ 70% to ~ 23% on the more recent standard,

The number that every significant AI laboratory has actually been utilizing to declare coding supremacy was simply stated worthless.

OpenAI released a post today revealing that SWE-bench Verified, the go-to standard for determining AI coding abilities, is so filled with problematic tests and training information leak that it no longer informs you anything beneficial about whether a design can in fact compose software application.

The benchmark works like this: Offer an AI a genuine GitHub problem from a popular open-source Python task, ask it to repair the bug without seeing the tests, and examine if its spot makes the stopping working tests pass without breaking anything else.

OpenAI developed SWE-bench Verified in August 2024 as a cleaner variation of the initial 2023 standard, hiring 93 software application engineers to filter out jobs that were difficult or badly developed.

The clean-up worked well adequate that every significant laboratory began mentioning ratings on it as evidence of development. When Anthropic released Claude Opus 4 in Might 2025, Decrypt reported that the design scored 72.5% on SWE-bench Verified, beating GPT-4.1’s 54.6% and Gemini 2.5 Pro’s 63.2%. It was the coding standard that mattered.

Ever Since, every AI laboratory from America to China has actually revealed the SWE efficiency to declare the throne as the very best design for coding abilities.

Now OpenAI states that race was partially a mirage. According to the report, the group investigated 138 jobs that GPT-5.2 regularly stopped working throughout 64 independent runs, and had 6 engineers examine every one. It eventually concluded that 59.4% of those jobs are broken.

About 35.5% have tests so directly composed that they need a particular function name never ever discussed in the issue description. Another 18.8% look for functions that weren’t part of the initial issue at all, collected from unassociated pull demands.

The contamination issue approximately works like this: SWE-bench pulls its issues from open-source repositories that a lot of AI business crawl when developing training sets. OpenAI checked whether GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash Sneak peek had actually seen the standard’s options throughout training. All 3 had.

Offered just a job ID and a quick tip, each design might recreate the precise code repair from memory, consisting of variable names and inline remarks that appear no place in the issue description. In one case, GPT-5.2’s chain-of-thought logs revealed it thinking that a particular specification should have been “included around Django 4.1”– an information discovered just in Django’s release notes, not the job description. It was responding to a concern it had actually currently seen the response to.

OpenAI now suggests SWE-bench Pro, a more recent standard from Scale AI that utilizes more varied codebases and licenses that minimize training information direct exposure. The efficiency drop is disconcerting: designs that cleared 70% on the old Verified standard score around 23% on SWE-bench Pro’s public split, and even less on its personal jobs.

On the existing public SWE-bench Confirmed leaderboard, OpenAI is far from the standard’s podium. Retiring a criteria where you’re losing and backing one where everybody begins at 23% resets the scoreboard at a practical minute and makes the rivals’ claims less remarkable.

This is specifically essential thinking about that the much expected more recent variation of DeepSeek is reported to beat or get incredibly near to American ai designs, specifically in agentic and coding jobs with a totally free, open-source design. That design might be days far from release, and SWE-bench Verified can be an essential metric to determine its quality.

OpenAI stated it’s developing independently authored examinations that will not be launched before screening, indicating its GDPVal task where domain specialists compose initial jobs graded by skilled human customers.

The benchmark issue is not brand-new, and is not distinct to coding. AI laboratories have actually cycled through numerous examinations, each beneficial till designs were trained on them or till the jobs showed too narrow.

However what makes this case significant is that OpenAI hyped SWE-bench Verified, promoted it throughout design releases, and is now openly recording how completely it has actually stopped working– consisting of by revealing their own design unfaithful on it.

Daily Debrief Newsletter

Start every day with the leading newspaper article today, plus initial functions, a podcast, videos and more.

Source

The Government of Canada announces the appointment of a new Board director for Windsor-Detroit Bridge Authority

The Government of Canada announces the appointment of a new Board director for Windsor-Detroit Bridge Authority

Charter Communications Unusual Options Activity – Charter Communications (NASDAQ:CHTR)

Deutsche Bank Turns Bearish On Blue Owl Capital – Apollo Global Management (NYSE:APO), Brookfield Asset Mgmt (NYSE:BAM)

Bitcoin Trades Near Fair-Value As Buyer Interest Weakens At $64K

Bitcoin Trades Near Fair-Value As Buyer Interest Weakens At $64K

Ether Whale Orders Shrink as $2B Short Cluster Sits Near $2K

Solo Bitcoin Miner Hits Rare 3.125 BTC Jackpot With Rented Hashrate

Stripe President John Collison on new tender offer, software sell-off and impact of AI

Stripe President John Collison on new tender offer, software sell-off and impact of AI

OpenAI Says Benchmark Used to Measure AI Coding Skill Is ‘Contaminated’—Here’s Why

PayPal pops nearly 7% on report fintech startup Stripe is weighing an acquisition

Berkshire was a net seller of stocks in Buffett’s final quarter as CEO

Berkshire was a net seller of stocks in Buffett’s final quarter as CEO

The S&P 500 is caught in an unusually tight range. Is this bull market resilient or exhausted?

The big market rotation may lead to gains in this health care name. Trading it with options

Anthropic Plans Employee Share Sale Worth Up To $6B

AI Feedback Loop Could Threaten SaaS Stocks, Boost Nvidia – Salesforce (NYSE:CRM), ServiceNow (NYSE:NOW)

AI Is Rewriting Markets — And Goldman Says HALO Stocks May Be The Big Winners – First Trust DJ Internet Index Fund (ARCA:FDN), VanEck Gold Miners ETF (ARCA:GDX)

Paul Tudor Jones’ IBM Bet Hit By $30 Billion Claude Code Shock – IBM (NYSE:IBM)

EY Digital Chief Says Marketing At An ‘Inflection Point’ As AI Gains Budget Priority

Is Artificial General Intelligence Already Here? One AI Founder Thinks So

Charter Communications Unusual Options Activity – Charter Communications (NASDAQ:CHTR)

Stripe President John Collison on new tender offer, software sell-off and impact of AI

OpenAI Says Benchmark Used to Measure AI Coding Skill Is ‘Contaminated’—Here’s Why

Deutsche Bank Turns Bearish On Blue Owl Capital – Apollo Global Management (NYSE:APO), Brookfield Asset Mgmt (NYSE:BAM)

AT&T Unusual Options Activity – AT&T (NYSE:T)

PayPal pops nearly 7% on report fintech startup Stripe is weighing an acquisition

Berkshire was a net seller of stocks in Buffett’s final quarter as CEO

Options Corner: DoorDash Just Received A Surprising Boost From Ark Invest’s Cathie Wood – DoorDash (NASDAQ:DASH)

Deere Unusual Options Activity – Deere (NYSE:DE)

Popular News

Saylor vs. Thiel: Two Different Crypto Bets

Adobe, RH And 3 Stocks To Watch Heading Into Friday – Adobe (NASDAQ:ADBE)

Paxos Proposes Stablecoin for Hyperliquid with HYPE Buyback

OpenAI Says Benchmark Used to Measure AI Coding Skill Is ‘Contaminated’—Here’s Why

In quick

Daily Debrief Newsletter

Related Articles

Subscribe to Updates