In quick
- ARC-AGI-3 exposes a huge space in between AGI claims and truth, with leading AI designs scoring listed below 1% while people accomplish best efficiency.
- The benchmark tests real generalization– needing representatives to check out, strategy, and gain from scratch in unidentified environments instead of recall skilled patterns.
- In spite of market buzz, present AI systems stay far from AGI, doing not have the thinking and flexibility that even young people show naturally.
Nvidia CEO Jensen Huang went on Lex Fridman’s podcast recently and stated, clearly, “I believe we have actually attained AGI.” 2 days later on, the most strenuous test in AI research study dropped its latest synthetic basic intelligence standard– and every frontier design scored listed below 1%.
The ARC Reward Structure launched ARC-AGI-3 today, and the outcomes are ruthless. Google’s Gemini 3.1 Pro led the pack at 0.37%. OpenAI’s GPT-5.4 was available in at 0.26%. Anthropic’s Claude Opus 4.6 handled 0.25%, while xAI’s Grok-4.20 scored precisely no. Human beings, on the other hand, fixed 100% of environments.
This isn’t a trivia test or coding examination, and even ultra-hard PhD-level concerns. ARC-AGI-3 is something totally various from anything the AI market has actually dealt with in the past.
The standard was developed by François Chollet and Mike Knoop’s structure, which established an internal video game studio and developed 135 initial interactive environments from scratch. The concept is to drop an AI representative into an unknown game-like world with no directions, no specified objectives, and no description of the guidelines. The representative needs to check out, determine what it’s expected to do, form a strategy, and perform it.
If that seems like something any five-year-old can do, you’re beginning to comprehend the issue. If you wish to see if you are much better than AI, you can play the exact same video games included in the test by clicking this link. We attempted one; it was strange initially, however after a couple of seconds, you can quickly master it.
It likewise is the clearest example of what the “G” in AGI means. When you generalize, you have the ability to produce brand-new understanding (how a strange video game works) without being trained on it beforehand.
Previous variations of ARC evaluated fixed visual puzzles– reveal a pattern, anticipate the next one. They were hard initially. Then the laboratories tossed calculate power and training at them till the criteria were efficiently dead. ARC-AGI-1, presented in 2019, was up to test-time training and thinking designs. ARC-AGI-2 lasted about a year before Gemini 3.1 Pro struck 77.1%. The laboratories are great at saturating criteria they can train versus.
Variation 3 was developed particularly to avoid that. With 110 of the 135 environments kept personal– 55 semi-private for API screening, 55 totally locked for competitors– there’s no dataset to remember. You can’t brute-force your method through unique video game reasoning you have actually never ever seen.
Scoring isn’t pass/fail either. ARC-AGI-3 utilizes what the structure calls RHAE– Relative Human Action Effectiveness. The standard is the second-best, first-run human efficiency. An AI that takes 10 times as lots of actions as a human ratings 1% for that level, not 10%. The formula squares the charge for inadequacy. Roaming around, backtracking, and thinking your method to a response gets penalized hard.
The very best AI representative in the month-long designer sneak peek scored 12.58%. Frontier LLMs evaluated through the main API, without any custom-made tooling, could not split 1%. Regular people fixed all 135 environments without any previous training and no directions. If that’s the bar, then the present crop of designs isn’t clearing it.
There is one genuine methodological dispute here. ARC’s report states a Duke-built custom-made harness pressed Claude Opus 4.6 from 0.25% to 97.1% on a single environment version called TR87. That does not indicate Claude scored 97.1% on ARC-AGI-3 in general; its main benchmark rating stayed 0.25%, however the shift is still worth keeping in mind.
The main standard feeds representatives JSON code, not visuals. That’s either a methodological defect or a presentation that today’s designs are much better at processing human-friendly info than raw structured information. Chollet’s structure has actually acknowledged the dispute, however isn’t altering the format.
” Frame material understanding and API format are not restricting elements for frontier design efficiency on ARC-AGI-3,” the paper checks out. Simply put, they appear to decline the concept that designs stop working due to the fact that they “can’t see” the jobs appropriately, arguing rather that understanding is currently enough– and the genuine space depends on thinking and generalization.
The AGI truth check got here throughout a week when the buzz device was performing at complete speed. Besides Huang’s remark, Arm called its brand-new information center chip the “AGI CPU.” OpenAI’s Sam Altman has actually stated they have actually “essentially developed AGI,” and Microsoft is currently marketing a laboratory concentrated on structure ASI: A development of what follows AGI is attained. The term is being extended till it suggests whatever is commercially practical, it appears.
Chollet’s position is easier. If a regular human without any directions can do it, and your system can’t, then you do not have AGI– you have a really costly autocomplete that requires a great deal of aid.
ARC Reward 2026 is using $2 million throughout 3 competitors tracks, all hosted on Kaggle. Every winning service needs to be open-sourced. The clock is running, and today, the makers aren’t even close.
Daily Debrief Newsletter
Start every day with the leading newspaper article today, plus initial functions, a podcast, videos and more.
