Google just threw away a network topology they've used for ten years. That's the story nobody wrote.
TPU 8i replaces the 3D torus with Boardfly -- a high-radix topology that cuts maximum hop count 56% for MoE inference. Google just declared training and inference need different network fabrics.
May 2, 2026TPU 8t and TPU 8i dropped at Cloud Next four days ago. The coverage has been about the numbers -- 121 exaflops, 2 petabytes of shared memory, 80% better inference price-performance. Those numbers are real and they're large. They're not the interesting thing.
The interesting thing is the Boardfly topology on TPU 8i.
Since TPU v4, every Google training chip has used a 3D torus interconnect. 3D torus is the right topology for training -- you're doing dense all-reduce operations, collective bandwidth is everything, and the torus distributes that bandwidth evenly across a regular 3D grid of chips. Google has been iterating on the 3D torus for four generations. They understand it deeply. It's in every piece of inference and training infrastructure they've built.
TPU 8i doesn't have it.
Boardfly is a high-radix topology inspired by Dragonfly networks, optimized for all-to-all communication patterns. It cuts the maximum network diameter of a 1,024-chip configuration from 16 hops down to 7 -- a 56% reduction. In a 3D torus, the longest path between two chips scales with the cube root of cluster size. In Boardfly, it doesn't grow nearly as fast. The maximum distance is bounded differently.
Why does this matter? Because training and inference have different communication patterns, and those patterns have been fighting over the same network topology for years.
Training is dominated by all-reduce. During a backward pass, every GPU or TPU needs to sum its gradient contributions with every other chip and distribute the result. This is a bandwidth-intensive operation where total throughput matters more than any individual hop's latency. The 3D torus is good at this -- regular, predictable, high aggregate bandwidth across the whole mesh.
Inference, specifically MoE inference, is dominated by all-to-all. Each token gets routed to a subset of expert FFN layers, potentially on different chips, and the activated expert outputs need to be gathered back. This is latency-sensitive in a way training isn't -- you're on the critical path of a user's request, and each MoE routing hop adds directly to time-to-first-token. The 3D torus is not good at this. Long hop counts in a high-diameter topology show up as latency you cannot hide.
Google co-designed Boardfly specifically for this pattern. Lower diameter, lower hop count, faster all-to-all. The 5x reduction in on-chip collective latency they're claiming comes partly from the new Collectives Acceleration Engine on the chip, and partly from the fact that the network fabric is no longer fighting the access pattern.
This is the decision that tells you how seriously Google took the bifurcation. You can put different memory configs on two chips pretty easily. You can give one more SRAM and one more HBM and call them specialized. Designing entirely different network topologies for two chips -- and betting a decade-old topology is wrong for one of the two major workloads you're serving -- is a harder call to make. It means you're committing that training and inference are different enough problems that they need different fabrics, not just different memory systems.
The other detail that didn't get enough coverage: TPU 8i replaces SparseCore with CAE.
SparseCore is Google's custom silicon for embedding table lookups -- the irregular memory access pattern that dominates recommendation models. It's been in TPUs since v4. Google built it for their internal recommendation workloads and kept it in the accelerator family as a general-purpose embedding unit.
TPU 8i removes it. In its place: the Collectives Acceleration Engine, dedicated to offloading reduction and synchronization operations during autoregressive decoding.
This is a specific choice about who TPU 8i is for. Google gave up embedding lookup acceleration on the inference chip to free silicon real estate for the communication primitive that matters in autoregressive generation. Embedding lookup is still important for recommendation models -- but TPU 8t keeps SparseCore for training workloads. The inference chip is optimized for transformer decode. Period. If you're running recommendation, use 8t or Ironwood. If you're running MoE generation at low latency, 8i has hardware you cannot get anywhere else.
I want to sit with something here.
NVIDIA announced AFD -- Attention-FFN Disaggregation -- at GTC in March. The insight: attention and FFN have different hardware affinities, so route them to different chips. The GPU handles attention (memory-capacity-bound), the Groq LPU handles FFN (memory-bandwidth-bound). Two chips, one serving path.
Google announced TPU 8t and 8i at Cloud Next in April. The insight: training and inference have different network topology requirements, different memory subsystem requirements, different on-chip acceleration requirements. So build two chips. One for each.
Meta is shipping MTIA chips on a six-month cadence, specialized for recommendation inference. AWS has Trainium3. Microsoft has Maia. Every company running AI at the scale where hardware economics matter has reached the same conclusion independently: the general-purpose accelerator is the wrong answer for at least one of the workloads they're running.
This is not a coincidence. It is what happens when the inference workload scales large enough that the mismatch between hardware design and workload requirements becomes the largest cost driver. You tolerate the mismatch when the scale is small. At Anthropic's scale -- 3.5 gigawatts of TPU capacity committed starting 2027 -- you cannot afford it. You build the right chip.
The goodput number is the one I keep coming back to.
TPU 8t is engineered to target over 97% goodput. Not 97% utilization. Goodput -- the fraction of compute cycles that produce useful training progress rather than being lost to overhead, failures, recomputation, or idle time from synchronization.
At frontier training scale, every percentage point of goodput is days of wall-clock training time. A 10,000-chip cluster losing 3% to overhead is losing the equivalent of 300 chips running continuously. At the capital cost of a 10,000-chip deployment, 3% overhead is not a footnote. The 97% target is the number that tells you Google is engineering for production reliability at a level that most hardware vendors don't quote because it's hard to hit and hard to verify.
Amin Vahdat, Google's SVP for AI infrastructure, said at Cloud Next: "peak FLOPS is a marketing number. Goodput is what determines whether your training cycles get wasted."
He's right. I've been saying a version of this about inference for months. Nice to see it show up in a hardware announcement.
four days ago google split a chip family they've been iterating for ten years into two.
different topologies. different on-chip accelerators. different memory subsystems. different design partners -- broadcom builds 8t, mediatek builds 8i.
the same workload bifurcation story as NVIDIA's AFD, META's MTIA, and every other silicon specialization announcement of the last six months -- but told through network topology.
the 3d torus is wrong for MoE inference.
it took four generations to say that out loud.
boardfly is the tell. you don't throw away a ten-year topology unless you're certain the replacement is better for the workload. that certainty comes from running the inference numbers at anthropic-scale and watching the 3d torus fail to hide the hop latency.
P.S. TorchTPU is in preview. Native PyTorch on TPU without rewriting in JAX. This has been the single largest adoption friction point for TPUs outside of Google's own teams -- the CUDA/PyTorch ecosystem is where 90% of ML engineers live and asking them to rewrite in JAX was a real ask. If TorchTPU ships to GA with reasonable performance parity, the total addressable workload for TPU 8i expands significantly. Watch the benchmarks when they drop. That is the number that tells you whether the software story caught up to the hardware story.
i write these when i have something worth saying. no schedule. no algorithm. if you want to know when the next one goes up -- leave your email.
no spam. no sequence. just the note, when it exists.