While everyone else is shopping at the Nvidia store, Google's been designing its own groceries for two years.
The Summary
- Google unveiled its eighth-generation Tensor Processing Units (TPUs) with a split architecture: TPU 8t for training, TPU 8i for inference and agentic workloads.
- The decision to build two specialized chips came in 2024, before reasoning models and agents became the industry consensus.
- Google's vertical integration across silicon, infrastructure, and software is showing up in cost-per-token economics competitors can't match.
The Signal
Google made the call to split its TPU roadmap in 2024. That timing matters. This was before OpenAI's o1, before reasoning models became table stakes, before every startup pitch deck included the word "agentic." While the rest of the industry was still treating training and inference as variations on the same compute problem, Google was already designing different silicon for each.
The v8t handles frontier model training. The v8i handles low-latency inference and the memory-intensive sampling patterns that agents demand. Different thermal profiles, different memory hierarchies, different trade-offs. This isn't just product differentiation. It's a structural bet that the AI workload is bifurcating permanently.
"One chip a year wasn't enough. This is our first shot at actually going with two super high-powered specialized chips."
Here's why the timing is revealing. In 2024, most labs were still spending 80-90% of their compute budget on pre-training. Inference was an afterthought, something you threw on cheaper chips once the real work was done. Google saw agents coming before the market did. More importantly, they saw that agents wouldn't just need more inference compute. They'd need different inference compute. Longer context windows, multi-turn reasoning, real-time tool use. None of that runs efficiently on hardware optimized for batch gradient descent.
The economics here are what Google wants you to notice. Amin Vahdat, Google's SVP for AI infrastructure, spent his stage time talking about cost-per-token, not teraflops. That's a tell. When you control the whole stack, silicon to API, you can optimize for the metric that actually matters to customers. Nvidia sells chips. Google sells inference at scale. Those are different games with different cost structures.
Key advantages of vertical integration:
- Custom interconnects between chips, no generic PCIe bottlenecks
- Software compilers tuned to exact silicon capabilities
- Workload schedulers that know which chip is cheaper for which job
Most cloud providers are renting Nvidia H100s and reselling access. They pay Nvidia's gross margins, then add their own. Google doesn't. They're eating their own manufacturing costs, which are lower, and passing some of that arbitrage to customers. The rest becomes margin that compounds with every token served.
The Implication
If you're building agents that need to run millions of inferences per day, the cost structure of your compute matters more than the headline spec. Google's bet is that enterprises will route training to one chip and inference to another, letting the economics guide the architecture. That works if you're on Google Cloud. It doesn't work if you're renting generic GPUs from three different providers and duct-taping them together.
Watch for more hyperscalers to follow this split-architecture path. Microsoft, Amazon, and Meta all have custom silicon projects in flight. The Nvidia tax only makes sense when you're buying commodity chips for commodity workloads. Agents aren't commodity workloads anymore.