François Chollet just dropped a benchmark that measures what AI actually can't do yet, and the timing couldn't be more pointed.
The Summary
- The ARC Prize Foundation released ARC-AGI-3, a new benchmark with 1,000+ video-game-like scenarios testing whether AI agents can reason through novel situations rather than pattern-match from training data
- Humans breeze through these tasks. Most AI systems don't. That gap measures something fundamental about intelligence versus memorization.
- $1 million prize pool for agents that can solve efficiently, with scoring based on step economy, not just completion
The Signal
Chollet has been the irritating voice in AI research for years, the one pointing out that we're measuring the wrong thing. While everyone celebrated GPT's ability to write sonnets and pass bar exams, he kept asking: can it actually think through something it's never seen before? Now that agent companies are trying to ship autonomous systems that operate in unpredictable environments (your inbox, your codebase, your supply chain), his critique suddenly matters commercially.
The ARC-AGI-3 benchmark isn't testing whether an AI can translate French or summarize a PDF. It's testing whether an agent can walk into a simple game world with zero instructions, figure out the rules through observation, develop a theory about how things work, then execute a multi-step strategy to reach a goal. This is what your autonomous sales agent will need when it encounters a prospect who doesn't match any pattern in your CRM. This is what your coding agent needs when it hits a bug configuration no one documented on GitHub.
The benchmark rewards efficiency explicitly. You don't win by brute-forcing solutions through massive compute. You win by building abstractions, by actually understanding the underlying structure of a problem. That's closer to how humans work and further from how current transformer models work, which is why many AI systems still struggle with these tasks while humans solve them easily.
The timing of this release matters. We're in the middle of an agent deployment wave. Companies are racing to ship AI that operates semi-autonomously. The gap between "works great in the demo" and "handles Tuesday's weird edge case" is where millions of dollars get burned. A benchmark that actually measures adaptability to novelty is suddenly not academic posturing but commercial survival.
The Implication
If you're building or buying agent infrastructure, watch which systems start scoring well on ARC-AGI-3. That's your signal for which architectures can actually handle operational reality versus which ones just memorized Stack Overflow really well. The models that can generalize from sparse examples are the ones that won't call you at 3am when they encounter something outside their training distribution. This benchmark might finally separate the agents that work from the agents that just work in the pitch deck.
Source: Fast Company Tech