AI Models Score Below 1% on New AGI Benchmark Test

Quick Summary: The ARC-AGI-3 benchmark reveals top AI models scoring under 1% on novel interactive environments that ordinary humans solve with ease.

Just days after Nvidia CEO Jensen Huang declared on the Lex Fridman podcast that artificial general intelligence has been achieved, a new benchmark designed to test that very claim delivered a stark counterpoint. The ARC Prize Foundation released ARC-AGI-3 this week, and every major frontier model scored below 1%. Humans, by contrast, solved all 135 environments with a perfect record and no prior training.

The results place the leading AI systems in a difficult position. Google‘s Gemini 3.1 Pro led all models at just 0.37%, followed by OpenAI‘s GPT-5.4 at 0.26% and Anthropic‘s Claude Opus 4.6 at 0.25%. xAI‘s Grok-4.20 scored exactly zero. The gap between human performance and machine performance could hardly be wider.

ARC-AGI-3 is not a trivia quiz, a coding exam, or a set of PhD-level questions. The benchmark was built by François Chollet and Mike Knoop‘s foundation, which established an in-house game studio and created 135 original interactive environments from scratch. An AI agent is dropped into an unfamiliar game-like world with no instructions, no stated goals, and no description of the rules. It must explore the environment, determine what it is supposed to accomplish, form a plan, and carry it out.

Earlier versions of the benchmark tested static visual puzzles and were eventually overcome by labs that threw compute power and targeted training at them. ARC-AGI-1, introduced in 2019, was defeated by test-time training and reasoning models. ARC-AGI-2 lasted roughly a year before Gemini 3.1 Pro reached 77.1%. Version 3 was designed to close that loophole. With 110 of the 135 environments kept private, there is no dataset for models to memorize, and novel game logic cannot be brute-forced.

Scoring is also structured to penalize inefficiency. The foundation uses a system called RHAE, or Relative Human Action Efficiency, which benchmarks performance against the second-best first-run human result. An agent that requires ten times as many actions as a human scores 1% for that level, not 10%, because the formula squares the penalty for inefficiency. Wandering, backtracking, and guessing are all punished heavily under this approach.

One methodological debate has emerged from the developer preview period. A custom harness built by a team at Duke pushed Claude Opus 4.6 from its official score of 0.25% to 97.1% on a single environment variant. That result does not alter the model’s overall benchmark score, but it has prompted discussion about whether the benchmark’s use of JSON code rather than visual input disadvantages certain models. The foundation has acknowledged the debate but states that perception is not the limiting factor, arguing the real gap lies in reasoning and generalization.

The benchmark’s release coincided with a period of intensifying AGI claims across the industry. Beyond Huang’s comments, Arm named a new data center chip the AGI CPU, OpenAI CEO Sam Altman has said the company has essentially built AGI, and Microsoft is already marketing a lab focused on building systems beyond AGI. Chollet’s position offers a simpler standard: if an ordinary person with no instructions can complete a task and a model cannot, the model does not qualify as general intelligence.

The best AI agent during the month-long developer preview reached 12.58%, while frontier models tested through the official API without custom tooling could not surpass 1%. ARC Prize 2026 is offering $2 million across three competition tracks hosted on Kaggle, with a requirement that all winning solutions be open-sourced. The competition is underway, and based on current results, the leading models remain far from the threshold.

Originally reported by Decrypt.

AI Models Score Below 1% on New AGI Benchmark Test

OpenAI Abandons Erotic Chatbot Feature for ChatGPT

Trump DOJ Prosecutes Crypto Developers Despite Privacy Promises

Tokenization Could Create Dual US Market Structure

Bitcoin Falls 2.3% Amid Middle East Uncertainty

AI Models Score Below 1% on New AGI Benchmark Test

Related Posts

OpenAI Abandons Erotic Chatbot Feature for ChatGPT

Trump DOJ Prosecutes Crypto Developers Despite Privacy Promises

Tokenization Could Create Dual US Market Structure

Bitcoin Falls 2.3% Amid Middle East Uncertainty