Close Menu
Trader News
  • Markets
    • Stocks
    • Futures
    • Forex
    • Commodities
    • OTC
    • QB
    • QX
    • PINK
    • Crypto
    • Options
    • Bonds
  • Crypto
    • Market
    • BTC
    • NFTs
    • DeFi
  • Technology
    • Web3
    • FinTech
    • EdTech
    • AI
  • Startups
  • Real Estate
  • Personal Finance
    • Retirement
    • Investing
  • More
    • Market Data
    • Glossary
    • Crypto Heatmap
    • Newsletter
    • Submit News
    • Exchanges, Brokerage and Savings Platforms
X (Twitter)
X (Twitter) TikTok YouTube RSS
Trader News
  • Markets
    1. Stocks
    2. Futures
    3. Forex
    4. Commodities
    5. OTC
    6. QB
    7. QX
    8. PINK
    9. Crypto
    10. Options
    11. Bonds
    Featured

    The Government of Canada announces the appointment of a new Board director for Windsor-Detroit Bridge Authority

    By News RoomFeb 24, 2026 6:20 pm EST0
    Recent

    The Government of Canada announces the appointment of a new Board director for Windsor-Detroit Bridge Authority

    Feb 24, 2026 6:20 pm EST

    Charter Communications Unusual Options Activity – Charter Communications (NASDAQ:CHTR)

    Feb 24, 2026 6:19 pm EST

    Deutsche Bank Turns Bearish On Blue Owl Capital – Apollo Global Management (NYSE:APO), Brookfield Asset Mgmt (NYSE:BAM)

    Feb 24, 2026 5:23 pm EST
  • Crypto
    1. Market
    2. BTC
    3. NFTs
    4. DeFi
    Featured

    Bitcoin Trades Near Fair-Value As Buyer Interest Weakens At $64K

    By News RoomFeb 24, 2026 3:28 pm EST0
    Recent

    Bitcoin Trades Near Fair-Value As Buyer Interest Weakens At $64K

    Feb 24, 2026 3:28 pm EST

    Ether Whale Orders Shrink as $2B Short Cluster Sits Near $2K

    Feb 24, 2026 2:13 pm EST

    Solo Bitcoin Miner Hits Rare 3.125 BTC Jackpot With Rented Hashrate

    Feb 24, 2026 1:00 pm EST
  • Technology
    1. Web3
    2. FinTech
    3. EdTech
    4. AI
    Featured

    Stripe President John Collison on new tender offer, software sell-off and impact of AI

    By News RoomFeb 24, 2026 6:14 pm EST0
    Recent

    Stripe President John Collison on new tender offer, software sell-off and impact of AI

    Feb 24, 2026 6:14 pm EST

    OpenAI Says Benchmark Used to Measure AI Coding Skill Is ‘Contaminated’—Here’s Why

    Feb 24, 2026 5:28 pm EST

    PayPal pops nearly 7% on report fintech startup Stripe is weighing an acquisition

    Feb 24, 2026 5:13 pm EST
  • Startups
  • Real Estate
  • Personal Finance
    1. Retirement
    2. Investing
    Featured

    Berkshire was a net seller of stocks in Buffett’s final quarter as CEO

    By News RoomFeb 24, 2026 5:08 pm EST0
    Recent

    Berkshire was a net seller of stocks in Buffett’s final quarter as CEO

    Feb 24, 2026 5:08 pm EST

    The S&P 500 is caught in an unusually tight range. Is this bull market resilient or exhausted?

    Feb 24, 2026 4:03 pm EST

    The big market rotation may lead to gains in this health care name. Trading it with options

    Feb 24, 2026 2:48 pm EST
  • More
    • Market Data
    • Glossary
    • Crypto Heatmap
    • Newsletter
    • Submit News
    • Exchanges, Brokerage and Savings Platforms
Login
Trader News
You are at:Home » OpenAI Says Benchmark Used to Measure AI Coding Skill Is ‘Contaminated’—Here’s Why
AI

OpenAI Says Benchmark Used to Measure AI Coding Skill Is ‘Contaminated’—Here’s Why

News RoomNews RoomFeb 24, 2026 5:28 pm EST0 ViewsNo Comments4 Mins Read
Facebook Twitter Telegram WhatsApp Pinterest LinkedIn Tumblr Email Reddit
Share
Facebook Twitter LinkedIn Pinterest Email

In quick

  • OpenAI argues that SWE-bench Verified no longer shows genuine coding capability due to the fact that the standard is presumably polluted.
  • It is now pressing SWE-bench Pro as harder replacement.
  • Ratings plunged from ~ 70% to ~ 23% on the more recent standard,

The number that every significant AI laboratory has actually been utilizing to declare coding supremacy was simply stated worthless.

OpenAI released a post today revealing that SWE-bench Verified, the go-to standard for determining AI coding abilities, is so filled with problematic tests and training information leak that it no longer informs you anything beneficial about whether a design can in fact compose software application.

The benchmark works like this: Offer an AI a genuine GitHub problem from a popular open-source Python task, ask it to repair the bug without seeing the tests, and examine if its spot makes the stopping working tests pass without breaking anything else.

OpenAI developed SWE-bench Verified in August 2024 as a cleaner variation of the initial 2023 standard, hiring 93 software application engineers to filter out jobs that were difficult or badly developed.

The clean-up worked well adequate that every significant laboratory began mentioning ratings on it as evidence of development. When Anthropic released Claude Opus 4 in Might 2025, Decrypt reported that the design scored 72.5% on SWE-bench Verified, beating GPT-4.1’s 54.6% and Gemini 2.5 Pro’s 63.2%. It was the coding standard that mattered.

Ever Since, every AI laboratory from America to China has actually revealed the SWE efficiency to declare the throne as the very best design for coding abilities.

Now OpenAI states that race was partially a mirage. According to the report, the group investigated 138 jobs that GPT-5.2 regularly stopped working throughout 64 independent runs, and had 6 engineers examine every one. It eventually concluded that 59.4% of those jobs are broken.

About 35.5% have tests so directly composed that they need a particular function name never ever discussed in the issue description. Another 18.8% look for functions that weren’t part of the initial issue at all, collected from unassociated pull demands.

The contamination issue approximately works like this: SWE-bench pulls its issues from open-source repositories that a lot of AI business crawl when developing training sets. OpenAI checked whether GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash Sneak peek had actually seen the standard’s options throughout training. All 3 had.

Offered just a job ID and a quick tip, each design might recreate the precise code repair from memory, consisting of variable names and inline remarks that appear no place in the issue description. In one case, GPT-5.2’s chain-of-thought logs revealed it thinking that a particular specification should have been “included around Django 4.1”– an information discovered just in Django’s release notes, not the job description. It was responding to a concern it had actually currently seen the response to.

OpenAI now suggests SWE-bench Pro, a more recent standard from Scale AI that utilizes more varied codebases and licenses that minimize training information direct exposure. The efficiency drop is disconcerting: designs that cleared 70% on the old Verified standard score around 23% on SWE-bench Pro’s public split, and even less on its personal jobs.

On the existing public SWE-bench Confirmed leaderboard, OpenAI is far from the standard’s podium. Retiring a criteria where you’re losing and backing one where everybody begins at 23% resets the scoreboard at a practical minute and makes the rivals’ claims less remarkable.

This is specifically essential thinking about that the much expected more recent variation of DeepSeek is reported to beat or get incredibly near to American ai designs, specifically in agentic and coding jobs with a totally free, open-source design. That design might be days far from release, and SWE-bench Verified can be an essential metric to determine its quality.

OpenAI stated it’s developing independently authored examinations that will not be launched before screening, indicating its GDPVal task where domain specialists compose initial jobs graded by skilled human customers.

The benchmark issue is not brand-new, and is not distinct to coding. AI laboratories have actually cycled through numerous examinations, each beneficial till designs were trained on them or till the jobs showed too narrow.

However what makes this case significant is that OpenAI hyped SWE-bench Verified, promoted it throughout design releases, and is now openly recording how completely it has actually stopped working– consisting of by revealing their own design unfaithful on it.

Daily Debrief Newsletter

Start every day with the leading newspaper article today, plus initial functions, a podcast, videos and more.

Source

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

Related Articles

Anthropic Plans Employee Share Sale Worth Up To $6B

AI Feb 24, 2026 1:53 pm EST

AI Feedback Loop Could Threaten SaaS Stocks, Boost Nvidia – Salesforce (NYSE:CRM), ServiceNow (NYSE:NOW)

AI Feb 24, 2026 12:43 pm EST

AI Is Rewriting Markets — And Goldman Says HALO Stocks May Be The Big Winners – First Trust DJ Internet Index Fund (ARCA:FDN), VanEck Gold Miners ETF (ARCA:GDX)

AI Feb 24, 2026 10:29 am EST

Paul Tudor Jones’ IBM Bet Hit By $30 Billion Claude Code Shock – IBM (NYSE:IBM)

AI Feb 24, 2026 9:24 am EST

EY Digital Chief Says Marketing At An ‘Inflection Point’ As AI Gains Budget Priority

AI Feb 24, 2026 6:04 am EST

Is Artificial General Intelligence Already Here? One AI Founder Thinks So

AI Feb 23, 2026 8:55 pm EST
Add A Comment
Leave A Reply Cancel Reply

You must be logged in to post a comment.

Latest News

Charter Communications Unusual Options Activity – Charter Communications (NASDAQ:CHTR)

Feb 24, 2026 6:19 pm EST

Stripe President John Collison on new tender offer, software sell-off and impact of AI

Feb 24, 2026 6:14 pm EST

OpenAI Says Benchmark Used to Measure AI Coding Skill Is ‘Contaminated’—Here’s Why

Feb 24, 2026 5:28 pm EST

Deutsche Bank Turns Bearish On Blue Owl Capital – Apollo Global Management (NYSE:APO), Brookfield Asset Mgmt (NYSE:BAM)

Feb 24, 2026 5:23 pm EST

AT&T Unusual Options Activity – AT&T (NYSE:T)

Feb 24, 2026 5:18 pm EST

Subscribe to Updates

Get the latest markets news and updates directly to your inbox.

[newsletter_form]

Top News

FinTech

PayPal pops nearly 7% on report fintech startup Stripe is weighing an acquisition

By News RoomFeb 24, 2026 5:13 pm EST0

Thomas Fuller|SOPA Images|Lightrocket|Getty Images PayPal’s stock rose almost 7% on Tuesday following a report that…

Berkshire was a net seller of stocks in Buffett’s final quarter as CEO

Feb 24, 2026 5:08 pm EST

Options Corner: DoorDash Just Received A Surprising Boost From Ark Invest’s Cathie Wood – DoorDash (NASDAQ:DASH)

Feb 24, 2026 4:21 pm EST

Deere Unusual Options Activity – Deere (NYSE:DE)

Feb 24, 2026 4:16 pm EST
About
About

Trader News is the only source for the latest news and updates about the market, finance, crypto and real estate. Follow us to get the only news that matters.
We're social, connect with us:

X (Twitter) YouTube TikTok
Popular News

Saylor vs. Thiel: Two Different Crypto Bets

Aug 30, 2025 9:06 am EDT

Adobe, RH And 3 Stocks To Watch Heading Into Friday – Adobe (NASDAQ:ADBE)

Sep 12, 2025 3:44 am EDT

Paxos Proposes Stablecoin for Hyperliquid with HYPE Buyback

Sep 7, 2025 4:00 am EDT

Subscribe to Updates

Get the latest markets news and updates directly to your inbox.

[newsletter_form]
Copyright © 2026. TraderNews. All Rights Reserved.
  • Privacy Policy
  • Terms of use
  • Press Release
  • Advertise
  • Contact

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?