The companies building the agents that might run your life can't reliably tell you who's winning an election.
The Summary
- All four major AI chatbots—ChatGPT, Gemini, Claude, and Grok—fail to accurately answer questions about elections and geopolitics, according to a new Forum AI study
- Forum AI CEO Campbell Brown says AI companies need to stop "grading their own homework" and submit to independent evaluation
- The findings expose a critical gap: models trained to sound confident systematically fail at the most sensitive information tasks
The Signal
Forum AI tested the four dominant chatbots on questions about elections and geopolitics and found them broadly unreliable. Not just wrong occasionally, but systemically unable to handle factual questions about current events. This isn't an edge case. Elections and news are exactly the kind of queries millions of people are already using these tools for.
The companies behind these models, OpenAI, Google, Anthropic, and xAI, have spent the last two years telling us their systems are ready for real-world deployment. They've pushed them into search, email, customer service, coding assistants. But when an independent organization stress-tests them on verifiable facts, the systems buckle.
"AI companies will have to be different and stop grading their own homework."
Campbell Brown's criticism cuts to the structural problem. Every AI lab publishes its own benchmarks, runs its own evals, declares its own progress. There's no FDA for language models. No independent body that certifies a chatbot is fit for purpose before it goes to a billion users. The industry has been self-regulating, which in practice means not regulating.
Forum AI's study matters because it's external validation, not a vendor claim. It measured performance on tasks that matter: Can this tool tell me accurate information about the world as it exists right now? The answer, across all four major players, was no.
Key failure points:
- Hallucinated election results or timelines
- Mixed up geopolitical facts or presented outdated information as current
- Failed to distinguish between verified reporting and speculation
This isn't a training data problem you can fix with more scraping. It's an architecture problem. These models don't retrieve facts. They predict the next plausible token based on patterns. When the pattern fits but the fact doesn't, the model still sounds confident. It has no mechanism to doubt itself, no way to say "I don't actually know this."
The timing matters. We're eighteen months into the agent economy. Companies are deploying AI to handle customer inquiries, draft communications, summarize documents, research competitors. If the underlying models can't accurately answer a question about an election, what happens when they're tasked with interpreting regulatory filings or market-moving news?
The Implication
If you're building with these models, assume they will confidently lie to you about current events. Don't route news queries, political questions, or time-sensitive factual lookups through LLMs without a retrieval layer that pulls from verified, timestamped sources. The chatbot interface makes users trust the output. That trust is misplaced.
For the AI companies, this is a credibility test. If independent audits keep showing your systems fail basic accuracy checks, the "move fast" excuse stops working. The question isn't whether regulation is coming. It's whether the industry builds trustworthy eval infrastructure before governments do it for them.