The smartest AI on the planet can write a React app but can't figure out how to jump over a turtle in Mario.
The Summary
- Large language models still can't play video games, even as they ace increasingly complex benchmarks in coding, math, and reasoning.
- Julian Togelius, director of NYU's Game Innovation Lab, argues this failure reveals fundamental limits in how today's AI actually understands sequential decision-making.
- Coding works as an "extremely well-behaved game" with immediate, granular feedback. Video games require spatial reasoning, long-term planning, and learning from sparse rewards.
The Signal
Gemini 2.5 Pro technically beat Pokemon Blue in May 2025. It took weeks. It made the same mistakes over and over. It needed custom software just to press the right buttons. Call it a win if you want, but it looked less like intelligence and more like a very expensive random number generator that occasionally got lucky.
Julian Togelius's research frames the problem clearly: LLMs excel at coding because coding is a game with training wheels. You write a function, it either compiles or it doesn't. The tests pass or they fail. The error messages tell you exactly what broke. Every line of code gives you feedback. You're learning in real-time, with guardrails.
"There's a theory from game designer Raph Koster that games are fun because we learn to play them as we play them. From that perspective, writing code is an extremely well-designed game."
Video games strip all that away. You don't get a stack trace when you die to the same boss for the tenth time. The reward structure is sparse. You might play for an hour before you know if your strategy actually works. And the spatial reasoning required, jumping platforms, dodging bullets, navigating 3D environments, is something LLMs fundamentally don't do. They tokenize text. They don't model physical space.
This isn't just a quirky limitation. Togelius points out we don't have general game AI at all, not just in LLMs. DeepMind's agents crushed Go and StarCraft, but those were purpose-built reinforcement learning systems trained on millions of games. They didn't generalize. They couldn't take what they learned in StarCraft and apply it to Minecraft.
The gap matters because video games are a better proxy for real-world agent work than coding is. In the real world:
- Feedback is delayed and ambiguous
- You need to model physical or pseudo-physical environments
- Success requires long chains of decisions where early mistakes compound
- The rules aren't handed to you in a spec sheet
An AI that can't navigate a 2D platformer isn't going to autonomously run your supply chain or manage a fleet of delivery robots. The skills don't transfer. And right now, the best LLMs on the planet are still stuck on level one.
The Implication
If you're building agents, test them on tasks that look more like video games and less like coding puzzles. Can they handle delayed feedback? Can they learn from failure without explicit error messages? Can they plan multiple steps ahead in an environment they don't fully control?
The companies that crack general game AI, systems that can pick up any game and learn to play it competently without custom training, will have built something far more valuable than a better chatbot. They'll have built the foundation for agents that actually work in the messy, low-feedback, spatially complex real world.