OpenAI's Best Model Just Failed Tests a 5-Year-Old Aces

Jensen Huang called AGI while the best models on Earth can't crack what a human child solves in seconds.

The Summary

ARC-AGI-3 benchmark launched the same week Nvidia's CEO declared artificial general intelligence achieved. Google's Gemini scored 0.37%. OpenAI's GPT-5.4 managed 0.26%. Humans score 100%.
The test measures abstract reasoning, the kind your brain does automatically when you see a pattern once and extrapolate it to new situations.
Gap between marketing narratives and actual capability has never been wider.

The Signal

ARC-AGI-3 is a benchmark designed to test what researchers call "fluid intelligence," the ability to reason abstractly without prior training on similar problems. Think visual pattern completion tasks a bright ten-year-old handles instinctively. The premise is simple: if you've achieved general intelligence, you should be able to solve novel problems by understanding underlying principles, not by pattern-matching against training data.

The timing here is pointed. Jensen Huang stood on stage and declared we've reached AGI. Meanwhile, the models everyone's betting billions on can't break 1% on tasks that require actual reasoning rather than sophisticated autocomplete. This isn't a minor gap. This is the difference between a calculator and a mathematician.

What's actually happening is that frontier models have gotten remarkably good at tasks with massive training datasets. They excel at anything that looks like something they've seen before. But show them a genuinely novel problem, something that requires forming a mental model and applying it to a new context, and they collapse. The 0.26% to 0.37% range isn't "early AGI." It's statistical noise around zero.

The ARC benchmark family has been around since 2019, designed specifically to resist the kind of brute-force scaling that's driven recent AI progress. ARC-AGI-3 ups the difficulty, and the results are clarifying. We're not approaching AGI. We're building incredibly powerful tools that are fundamentally different from general intelligence, and pretending otherwise serves nobody except people trying to justify valuations.

The Implication

If you're building in the agent economy, this matters more than any product launch. The agents you're deploying today are powerful narrow tools, not reasoning entities. Design for that reality. Give them constrained domains, clear guardrails, and human oversight on novel situations. The companies that will win in Web4 aren't the ones betting everything on imminent AGI. They're the ones building robust systems that combine narrow AI capabilities with human judgment in the right proportions. Watch what the serious researchers say, not what the CEOs selling chips claim.

Source: Decrypt