Is AGI Here? Not Even Close, New AI Benchmark Suggests

In quick

ARC-AGI-3 exposes a huge space in between AGI claims and truth, with leading AI designs scoring listed below 1% while people accomplish best efficiency.
The benchmark tests real generalization– needing representatives to check out, strategy, and gain from scratch in unidentified environments instead of recall skilled patterns.
In spite of market buzz, present AI systems stay far from AGI, doing not have the thinking and flexibility that even young people show naturally.

Nvidia CEO Jensen Huang went on Lex Fridman’s podcast recently and stated, clearly, “I believe we have actually attained AGI.” 2 days later on, the most strenuous test in AI research study dropped its latest synthetic basic intelligence standard– and every frontier design scored listed below 1%.

The ARC Reward Structure launched ARC-AGI-3 today, and the outcomes are ruthless. Google’s Gemini 3.1 Pro led the pack at 0.37%. OpenAI’s GPT-5.4 was available in at 0.26%. Anthropic’s Claude Opus 4.6 handled 0.25%, while xAI’s Grok-4.20 scored precisely no. Human beings, on the other hand, fixed 100% of environments.

This isn’t a trivia test or coding examination, and even ultra-hard PhD-level concerns. ARC-AGI-3 is something totally various from anything the AI market has actually dealt with in the past.

The standard was developed by François Chollet and Mike Knoop’s structure, which established an internal video game studio and developed 135 initial interactive environments from scratch. The concept is to drop an AI representative into an unknown game-like world with no directions, no specified objectives, and no description of the guidelines. The representative needs to check out, determine what it’s expected to do, form a strategy, and perform it.

If that seems like something any five-year-old can do, you’re beginning to comprehend the issue. If you wish to see if you are much better than AI, you can play the exact same video games included in the test by clicking this link. We attempted one; it was strange initially, however after a couple of seconds, you can quickly master it.

It likewise is the clearest example of what the “G” in AGI means. When you generalize, you have the ability to produce brand-new understanding (how a strange video game works) without being trained on it beforehand.

Previous variations of ARC evaluated fixed visual puzzles– reveal a pattern, anticipate the next one. They were hard initially. Then the laboratories tossed calculate power and training at them till the criteria were efficiently dead. ARC-AGI-1, presented in 2019, was up to test-time training and thinking designs. ARC-AGI-2 lasted about a year before Gemini 3.1 Pro struck 77.1%. The laboratories are great at saturating criteria they can train versus.

Variation 3 was developed particularly to avoid that. With 110 of the 135 environments kept personal– 55 semi-private for API screening, 55 totally locked for competitors– there’s no dataset to remember. You can’t brute-force your method through unique video game reasoning you have actually never ever seen.

Scoring isn’t pass/fail either. ARC-AGI-3 utilizes what the structure calls RHAE– Relative Human Action Effectiveness. The standard is the second-best, first-run human efficiency. An AI that takes 10 times as lots of actions as a human ratings 1% for that level, not 10%. The formula squares the charge for inadequacy. Roaming around, backtracking, and thinking your method to a response gets penalized hard.

The very best AI representative in the month-long designer sneak peek scored 12.58%. Frontier LLMs evaluated through the main API, without any custom-made tooling, could not split 1%. Regular people fixed all 135 environments without any previous training and no directions. If that’s the bar, then the present crop of designs isn’t clearing it.

There is one genuine methodological dispute here. ARC’s report states a Duke-built custom-made harness pressed Claude Opus 4.6 from 0.25% to 97.1% on a single environment version called TR87. That does not indicate Claude scored 97.1% on ARC-AGI-3 in general; its main benchmark rating stayed 0.25%, however the shift is still worth keeping in mind.

The main standard feeds representatives JSON code, not visuals. That’s either a methodological defect or a presentation that today’s designs are much better at processing human-friendly info than raw structured information. Chollet’s structure has actually acknowledged the dispute, however isn’t altering the format.

” Frame material understanding and API format are not restricting elements for frontier design efficiency on ARC-AGI-3,” the paper checks out. Simply put, they appear to decline the concept that designs stop working due to the fact that they “can’t see” the jobs appropriately, arguing rather that understanding is currently enough– and the genuine space depends on thinking and generalization.

The AGI truth check got here throughout a week when the buzz device was performing at complete speed. Besides Huang’s remark, Arm called its brand-new information center chip the “AGI CPU.” OpenAI’s Sam Altman has actually stated they have actually “essentially developed AGI,” and Microsoft is currently marketing a laboratory concentrated on structure ASI: A development of what follows AGI is attained. The term is being extended till it suggests whatever is commercially practical, it appears.

Chollet’s position is easier. If a regular human without any directions can do it, and your system can’t, then you do not have AGI– you have a really costly autocomplete that requires a great deal of aid.

ARC Reward 2026 is using $2 million throughout 3 competitors tracks, all hosted on Kaggle. Every winning service needs to be open-sourced. The clock is running, and today, the makers aren’t even close.

Daily Debrief Newsletter

Start every day with the leading newspaper article today, plus initial functions, a podcast, videos and more.

Source

Brent Crude Surges Above $115, Dow Futures Slide By 249 Points — Trump Says His ‘Favorite’ Thing To Do Would Be ‘Take The Oil In Iran’

Brent Crude Surges Above $115, Dow Futures Slide By 249 Points — Trump Says His ‘Favorite’ Thing To Do Would Be ‘Take The Oil In Iran’

Eli Lilly Strikes $2.75B Deal To Bring AI-Developed Drugs Globally

Michael Saylor Spent A Decade Doing ‘Everything Under The Sun,’ But Still Couldn’t ‘Get Ahead’ — Success Hit When He Understood This One Thing

Ethereum May Get ‘Flipped’ in 2026 Without Bitcoin’s Involvement

Ethereum May Get ‘Flipped’ in 2026 Without Bitcoin’s Involvement

BNP Paribas Adds Bitcoin, Ether ETNs for France Retail Users

Senator Warren is Probing Bitmain over US Security Risks: Report

Anthropic’s ‘Most Capable’ AI Model Claude Mythos Leaks, Deemed Major Cybersecurity Threat

Anthropic’s ‘Most Capable’ AI Model Claude Mythos Leaks, Deemed Major Cybersecurity Threat

‘All to Play For’: Walrus Hits 450TB of Data Stored Amid Renewed AI Push

Broadcom Stock Near Death Cross As Momentum Fades

Which retailers win and lose from high gas prices? Deutsche Bank sorts it out

Which retailers win and lose from high gas prices? Deutsche Bank sorts it out

Not everyone can expect a bigger tax refund this year — what’s actually driving your result

Psychedelic therapies are becoming mainstream. Deutsche Bank thinks this drug developer could triple

Anthropic’s ‘Most Capable’ AI Model Claude Mythos Leaks, Deemed Major Cybersecurity Threat

Broadcom Stock Near Death Cross As Momentum Fades

Trump names David Sacks co-chair of tech advisory council, expanding AI, crypto role

Wikipedia Bans AI-Generated Text in Articles Under New Editing Policy

First Sora, Now Sexy Chat? OpenAI Cancels Erotic ChatGPT Mode

SanDisk Chooses Taiwan Over U.S. To Secure AI Supply

Eli Lilly Strikes $2.75B Deal To Bring AI-Developed Drugs Globally

Michael Saylor Spent A Decade Doing ‘Everything Under The Sun,’ But Still Couldn’t ‘Get Ahead’ — Success Hit When He Understood This One Thing

Meta, YouTube $6M Trial Has Far-Reaching Consequences

Why Kurt Russell and Goldie Hawn are Trading LA for Colorado

PetroChina Successfully Concludes “the 14th Five-Year Plan”, 2025 Operating Results Remain at Historical High Levels

Michael Burry Blames Fannie Mae and Freddie Mac for Housing Stagnation

Which retailers win and lose from high gas prices? Deutsche Bank sorts it out

Sustainable Living Campaign Launches in L.A.

Not everyone can expect a bigger tax refund this year — what’s actually driving your result

Popular News

Bitcoin Cycle Will Continue In ‘Some Form,’ Says Gemini Exec

Not everyone can expect a bigger tax refund this year — what’s actually driving your result

Mark Cuban Fires Back At Musk’s ‘Work Optional’ Vision With Mock IPO Risk Filing As SpaceX IPO Looms

Is AGI Here? Not Even Close, New AI Benchmark Suggests

In quick

Daily Debrief Newsletter

Related Articles

Subscribe to Updates