The $26 billion AI coding company is telling everyone to stop measuring the wrong thing.
The Summary
- Cognition CEO Scott Wu says token spend leaderboards are "directionally correct" but companies rank engineers by the wrong metric: output matters, not how much API spend you rack up
- The tokenmaxxing phenomenon may have been overblown from the start, raising questions about whether enterprises ever really adopted spend-ranking at scale
- Wu argues that if engineers ship 3x more with AI, the compute cost is "clearly worth it," but rewards need to tie to actual delivery
The Signal
Cognition's Scott Wu just said the quiet part out loud: ranking engineers by token spend is incentivizing the wrong behavior. The CEO of the company behind Devin, the autonomous AI software engineer that helped push Cognition to a $26 billion valuation in May 2025, told the Founders podcast that while token leaderboards have the spirit right, execution is broken. "People are like, 'We rank our engineers by how many tokens they're spending,'" Wu said. "Well, let's try and rank people by how much output they're actually producing."
The timing matters. Tech leaders have spent months critiquing tokenmaxxing as wasteful theater. Now one of the most valuable AI coding startups is confirming what many suspected: the leaderboard craze confused activity with results. Wu's not arguing against AI spend. He's arguing against measuring the wrong thing.
"If engineers can ship three times more than they would without AI, it is clearly worth it."
SemiAnalysis raises a bigger question: was widespread tokenmaxxing ever really here? Their enterprise conversations suggest the phenomenon may have been more Twitter discourse than boardroom reality. If true, that means the backlash is landing harder than the original trend ever did.
Here's what Wu gets right: compute is expensive, but output compounds. An engineer who ships 3x more features with AI assistance creates 3x more leverage for the business. The token bill is a rounding error compared to the value of velocity. But only if that velocity produces working software, not just more code.
Key tensions emerging:
- Token spend as a proxy for AI adoption versus actual productivity gains
- How to measure "output" when AI changes what output looks like
- Whether leaderboards ever scaled beyond a handful of loud adopters
The real issue is what Wu hints at but doesn't fully unpack: nobody has figured out how to measure AI-assisted output yet. Lines of code shipped is a garbage metric. Features deployed gets closer, but ignores quality. Customer value delivered is right but hard to attribute. So companies default to what's easy to track, which is API spend. And that's how you get engineers gaming leaderboards instead of building products.
The Implication
If you're running engineering teams, Wu just gave you permission to kill the token leaderboard. Replace it with something harder but real: velocity of shipped features, reduction in bug rates, time from commit to production, customer-facing improvements per sprint. The metric matters less than the principle, which is measuring outcomes, not inputs.
For AI tooling companies, this is a warning shot. Enterprises are getting smarter about what they pay for. Selling "AI adoption" as a volume game won't work much longer. The next wave of sales conversations will center on productivity evidence, not token counts. Build the analytics that prove output gains, or watch budget holders get skeptical.