LLMs Beat Lawyers at the Bar Exam But Fail at Pokémon

The smartest AI systems on Earth can write code, beat grandmasters at chess, and pass the bar exam, but they can't figure out how to play Pokemon.

The Summary

LLMs remain terrible at video games, even as they excel at coding, chess, and complex reasoning tasks, with rare wins like Gemini 2.5 Pro beating Pokemon Blue requiring custom software and taking far longer than human players.
Julian Togelius from NYU's Game Innovation Lab argues the gap reveals fundamental limits in how current AI systems learn and adapt to open-ended environments.
Coding works for LLMs because it's a "well-behaved game" with immediate, granular feedback. Video games are messy, ambiguous, and require spatial reasoning AI doesn't have.

The Signal

Here's what matters. LLMs have gotten so good at specific tasks that we've started mistaking narrow competence for general intelligence. They crush coding benchmarks because code is perfectly designed for how they learn. You write something, it either compiles or it doesn't, tests pass or fail, and you get explicit feedback on what broke. It's a tight loop of clear rules and instant rewards. Video games, even simple ones, break that pattern completely.

Togelius frames coding as a kind of game, which is true. But it's a game with training wheels. Video games, the actual kind, require you to build spatial models, remember where you've been, plan multi-step sequences without clear checkpoints, and interpret visual information that changes frame by frame. When Gemini 2.5 Pro finally beat Pokemon Blue, it needed custom software just to interface with the game and made "bizarre and often repetitive mistakes" that no human player would make. This isn't AGI with a controller. This is a very sophisticated text predictor being forced into a domain it wasn't built for.

The broader implication cuts deeper than gaming. We don't have general game AI. We have extremely powerful pattern matchers trained on text and code. The moment you step outside domains with clean feedback loops and well-defined success criteria, the magic disappears. An AI agent that can't navigate a 2D Pokemon world is not going to autonomously manage your supply chain, negotiate contracts on your behalf, or make real-time decisions in messy environments. It might help, but it won't replace the human doing the actual steering.

The Implication

If you're building or buying AI agents for real work, ask yourself: does this task look more like coding or more like Pokemon? If it requires spatial reasoning, ambiguous goal-setting, or learning from sparse feedback in a dynamic environment, current LLMs are going to struggle. The hype cycle wants you to believe we're six months from autonomous everything. The Pokemon test says otherwise. Watch what AI can't do. That's where the actual work still lives.

Source: IEEE Spectrum AI