Hardware / T-2026-8346

Google splits its next TPU into two chips, one for training and one for reasoning

Q: Google splits its next TPU into two chips, one for training and one for reasoning — key point 1

Google split its TPU into two chips: TPU 8t for training with 121 exaflops and TPU 8i for inference with triple on-chip SRAM and 80% better performance per dollar.

Q: Google splits its next TPU into two chips, one for training and one for reasoning — key point 2

Virgo Network fabric connects up to 134,000 TPUs or 80,000 GPUs in a single data center, enabling near-linear scaling for frontier model training.

Q: Google splits its next TPU into two chips, one for training and one for reasoning — key point 3

TorchTPU adds native PyTorch support for TPUs with Eager Mode, removing a major adoption barrier by allowing models to run without code modification.

Google splits its next TPU into two chips — 8t for training, 8i for reasoning — and unveils a data center fabric that stitches a million accelerators together.

Tessera Newsroom · 5 min read · May 31, 2026

Source AI infrastructure at Next '26 | Google Cloud Blog (cloud.google.com)

FIGURE T-2026-8346

8 HARDWARE

Google Cloud announced a sweeping expansion of its AI infrastructure portfolio at its Next ‘26 conference on April 22, led by the eighth generation of its Tensor Processing Unit. For the first time, Google split the TPU into two distinct chips: TPU 8t for training and TPU 8i for inference and reinforcement learning. The company also detailed the A5X instance powered by NVIDIA Vera Rubin NVL72, a new data center fabric called Virgo Network, and a series of storage and orchestration upgrades.

The move reflects a clear thesis inside Google. The dominant workload pattern is shifting from the single-turn chat query to multi-step agentic workflows, and those workflows impose different demands on the silicon. A training chip needs raw FLOPs and memory bandwidth to chew through data for weeks. A reasoning chip needs ultra-low latency to handle the chain of tool calls, state preservation, and reinforcement learning feedback loops that define agent execution. Google is betting that one architecture cannot serve both roles efficiently.

TPU 8t is the training powerhouse. Each superpod packs 9,600 chips delivering 121 exaflops of compute and two petabytes of shared memory. Google says it can now orchestrate clusters of more than one million TPU chips using Pathways and JAX, shrinking months of training into weeks. The numbers are enormous, but the architectural detail that matters is the inter-chip interconnect bandwidth: doubled from the previous generation, which helps large models achieve near-linear scaling.

TPU 8i takes the opposite approach. It triples on-chip SRAM to 384 MB and increases HBM to 288 GB, allowing massive KV caches to live entirely on silicon. The ICI bandwidth doubles to 19.2 Tb/s, the network diameter shrinks by over 50%, and a new Collectives Acceleration Engine reduces on-chip latency by up to 5x. Google claims TPU 8i delivers 80% better performance per dollar for inference than the prior generation. That figure matters because inference costs, not training costs, are becoming the binding constraint for deployed AI products.

The split is a direct response to the economics of Mixture of Experts models and long-context reasoning. MoE models activate only a subset of parameters per token, which makes them efficient for training but latency-sensitive for inference. A chip designed for dense matrix multiplication may stall on the sparse, branching execution paths of an agent. TPU 8i is Google’s attempt to build a chip that does not stall.

Google also made a significant concession to the GPU camp. The A5X instance, powered by NVIDIA Vera Rubin NVL72, will be available later this year. Google is co-engineering the open-source Falcon networking protocol with NVIDIA through the Open Compute Project. The message is pragmatic: Google will sell both TPUs and GPUs, and let customers choose. The Virgo Network fabric that connects TPU 8t superpods will also support A5X, scaling up to 80,000 GPUs in a single data center and 960,000 GPUs across multiple sites.

The network story is arguably the most important announcement in the portfolio. Virgo Network uses a collapsed fabric architecture with 4x the bandwidth of the previous generation. Google says it can connect 134,000 TPUs into a single fabric in one data center, and more than one million TPUs across multiple data center sites into a single training cluster. That is the scale at which frontier model training now operates. Without a fabric that can sustain near-linear efficiency at those sizes, the individual chip improvements are irrelevant.

On the software side, Google announced native PyTorch support for TPUs, called TorchTPU, now in preview with select customers. TorchTPU supports Eager Mode, meaning models can run on TPUs without modification to the PyTorch code. This is a direct response to the criticism that TPUs are locked into Google’s JAX framework. If TorchTPU works as advertised, it removes a major adoption barrier. Google also highlighted its contributions to llm-d, a Kubernetes-native LLM inference framework accepted as a CNCF Sandbox project, alongside Red Hat, IBM Research, CoreWeave, and NVIDIA.

The storage announcements fill out the picture. Google Cloud Managed Lustre now delivers 10 TB/s of bandwidth, a 10x improvement over last year, with capacity up to 80 petabytes. Rapid Buckets on Cloud Storage achieves sub-millisecond latency and 20 million operations per second, which Google says helps maintain 95% accelerator utilization during training checkpoints and recoveries. The Z4M instance, with up to 168 TiB of local SSD per machine and RDMA support, targets customers who want to run their own parallel file systems from vendors like Vast Data or Sycomp.

Google is betting that one architecture cannot serve both training and reasoning efficiently. TPU 8i is the chip designed to not stall.

The GKE upgrades are worth noting for anyone deploying agent workloads at scale. Node startup is 4x faster, pod startup is slashed by up to 80%, and model loading is 5x faster using the run:AI Model Streamer and Rapid Cache. The Inference Gateway now uses machine learning for real-time capacity-aware routing, cutting time-to-first-token latency by more than 70% without manual tuning. For voice agents and real-time interactive applications, that latency reduction changes the product design space.

What is missing from the announcement is equally telling. Google did not disclose pricing for TPU 8t or TPU 8i. It did not say when the chips will be generally available, only “soon.” It did not provide independent benchmarks from customers running production agent workloads on the new silicon. The 80% better performance per dollar claim for TPU 8i is against the prior generation, not against competing hardware from NVIDIA or AMD.

The broader picture is that Google is building infrastructure for a workload that does not yet exist at scale. Agentic AI is the industry’s declared future, but the majority of deployed AI systems today still operate in the chat paradigm. Google is placing a large bet that the transition will happen fast, and that the companies that build for it first will capture the market. The TPU 8t and TPU 8i split, the Virgo Network fabric, the TorchTPU software layer, and the GKE inference optimizations are all designed for a world where a single user request triggers a fleet of specialized agents that collaborate, preserve state, and learn in real time.

That world may arrive in 2027 or 2028. Google is building the infrastructure for it now. The question for AI builders is whether to bet on the same timeline, or wait until the agent workloads materialize and the hardware prices come down.

Tessera Newsroom

Editorial

Masthead Contact

T-REL / HARDWARE

The 370kW Rack: Why Power and Cooling Define AI Infrastructure in 2026

AI rack densities are forcing a rebuild of the data center industry around liquid cooling, grid-scale power, and modular deployment

Tessera Newsroom · July 3, 2026

Hardware / T-2026-0097

Lambda's Balaban on AI Compute 2026: The Real Bottleneck is Shell, Not Silicon

Lambda's Stephen Balaban argues AI compute's real constraint is land and power, not chips, in a new podcast conversation.

Tessera Newsroom · July 2, 2026

Hardware / T-2026-8334

NVIDIA's RTX Spark Brings a Petaflop of AI Compute to the Windows PC

NVIDIA launches RTX Spark, a 1 petaflop AI superchip for Windows PCs, developed with MediaTek and optimized with Microsoft. First devices arrive fall 2026.

Tessera Newsroom · July 1, 2026