OpenAI Can't Explain How Its Own AI Works

The companies building godlike AI still don't know how their current models make decisions — and they're betting your future on figuring it out later.

The Summary

Daniel Kokotajlo, former OpenAI researcher who worked on forecasting and safety from 2022-2024, says AI alignment is an "open secret" — companies don't have a good plan for ensuring future AI systems follow human values
The core problem: researchers don't fully understand how advanced AI models make decisions internally, making it difficult to ensure AI systems reliably pursue the goals humans want
Kokotajlo now runs the AI Futures Project nonprofit, focusing on AI governance and reducing the risk of losing control as companies race toward AGI and superintelligence
AI agents could be the turning point where alignment failures become catastrophic

The Signal

Daniel Kokotajlo spent two years inside OpenAI studying how quickly AI systems could improve and what risks could emerge. His job was forecasting the economic, political, and safety implications of more powerful models. He left in 2024 with a clear message: the people building superintelligence don't actually know how to control it yet.

The alignment problem is simple to state and brutal to solve. You need AI systems that reliably do what humans want, even after those systems become smarter than humans in most domains. The catch is that researchers don't fully understand how current advanced models make decisions internally. You can't align what you don't understand.

"It's a sort of open secret, but we don't really have a good plan for how to do this yet."

This isn't a theoretical problem for 2040. Kokotajlo points to AI agents as the potential turning point — the moment when AI systems start taking sustained action in the world without human oversight at each step. Agents don't just answer questions. They pursue goals over time, make plans, and execute them. If those goals drift even slightly from what you intended, and the agent is more capable than you at achieving goals, you have a problem you can't easily undo.

The AI race compounds the issue. Companies are sprinting toward AGI and superintelligence while the alignment work lags behind capability development. Kokotajlo's current work through the AI Futures Project focuses on what governments and companies can do now to reduce the risk of losing control. The implicit argument is that waiting until we have superintelligence to figure out alignment is waiting too long.

The Implication

If you're building with AI agents, you're working in the exact zone Kokotajlo is worried about. Every agent you ship that makes autonomous decisions is a small-scale version of the alignment problem. Test it hard. Understand its failure modes. Don't deploy it into critical systems assuming it will stay aligned with your intent just because it did during development.

For everyone else, this is your reminder that the people building the future don't have all the answers yet. They're not hiding a secret plan. The plan doesn't exist. Watch how companies approach AI safety and alignment in practice, not just in blog posts. The gap between capability and control is the story of the next decade.

Sources

Business Insider Tech | Business Insider Tech

OpenAI Can't Explain How Its Own AI Works — But Keeps Scaling Anyway

The Summary

The Signal

The Implication

Sources

Keep Reading