Microsoft introduced Maia 200 in January, an inference accelerator built on TSMC’s 3nm process with over 140 billion transistors. The chip delivers 10 petaFLOPS in 4-bit precision (FP4) and over 5 petaFLOPS in 8-bit (FP8), all within a 750W thermal design power envelope. Scott Guthrie, Microsoft’s Executive Vice President of Cloud + AI, calls it “the most performant, first-party silicon from any hyperscaler.”
The numbers matter. Maia 200 claims three times the FP4 performance of Amazon’s third-generation Trainium and FP8 performance above Google’s seventh-generation TPU. But the more telling metric is the 30% improvement in performance per dollar compared to the latest generation hardware in Microsoft’s current fleet. That is the number that changes the calculus for Azure customers running inference workloads at scale.
Maia 200 is already deployed in Microsoft’s US Central datacenter region near Des Moines, Iowa, with US West 3 near Phoenix, Arizona, coming next. It is serving GPT-5.2 models from OpenAI, powering Microsoft Foundry and Microsoft 365 Copilot. The Microsoft Superintelligence team is using it for synthetic data generation and reinforcement learning.
Designed for the inference bottleneck
The chip’s architecture reveals a clear thesis: inference, not training, is where the hyperscaler wars will be won. Maia 200 packs 216GB of HBM3e memory at 7 TB/s bandwidth, plus 272MB of on-chip SRAM. The memory subsystem centers on narrow-precision datatypes, a specialized DMA engine, and a custom network-on-chip fabric for high-bandwidth data movement.
These choices reflect a specific understanding of the inference workload. Large language models are memory-bound during generation. The model weights and key-value cache must move from memory to compute units faster than the compute units can process them. Maia 200 attacks this by keeping more data on-chip and moving the rest through a redesigned memory hierarchy.
The two-tier scale-up network design runs on standard Ethernet, not proprietary fabrics. Each accelerator exposes 2.8 TB/s of bidirectional dedicated scaleup bandwidth. Clusters of up to 6,144 accelerators can perform collective operations with predictable performance. Within each tray, four Maia accelerators connect directly with non-switched links, keeping high-bandwidth communication local.
The software story matters more than the silicon
Microsoft is previewing the Maia SDK, which includes PyTorch integration, a Triton compiler, an optimized kernel library, and a low-level programming language called NPL. The SDK also ships a Maia simulator and cost calculator for developers to optimize workloads before deploying.
The software stack is the part most likely to determine whether Maia 200 succeeds or becomes another custom chip that only Microsoft’s internal teams can use effectively. Google’s TPU succeeded in part because of the XLA compiler and JAX integration. Amazon’s Trainium and Inferentia have struggled with developer adoption despite competitive hardware specs.
Microsoft’s bet on Triton is strategic. Triton, developed by OpenAI and now an open standard, allows developers to write custom GPU kernels in a Python-like language without needing to write CUDA. By supporting Triton on Maia 200, Microsoft signals that it wants to reduce the friction of porting models away from NVIDIA’s ecosystem. The same Triton kernels that run on H100s and B200s should, in theory, run on Maia 200 with minimal changes.
What Maia 200 means for the AI supply chain
The chip arrives at a moment when the AI industry is acutely sensitive to hardware dependency. NVIDIA holds an estimated 80%+ of the AI accelerator market. Every major cloud provider has invested in custom silicon as a hedge: Google with TPU, Amazon with Trainium and Inferentia, Microsoft with Maia.
Maia 200’s 3nm process node puts it on par with NVIDIA’s Blackwell architecture. The 750W TDP per chip is aggressive but manageable with Microsoft’s second-generation closed-loop liquid cooling, which it co-developed with the chip program. The time from first silicon to first datacenter rack deployment was less than half that of comparable AI infrastructure programs, according to Microsoft.
That speed is a function of Microsoft’s pre-silicon validation environment, which modeled computation and communication patterns of LLMs with high fidelity before the chip was fabricated. AI models were running on Maia 200 silicon within days of first packaged part arrival.
The open question: adoption beyond Microsoft
Maia 200 is not for sale as a standalone chip. It is an Azure resource, accessible through Microsoft’s cloud infrastructure. Developers who want to use it must port their models to the Maia SDK and run them in Microsoft’s datacenters.
This creates a familiar tension. Custom silicon gives Microsoft better margins on inference and reduces dependency on NVIDIA. But it also creates a lock-in dynamic that some developers will resist. The Triton compiler integration helps, but the real test is whether the performance-per-dollar advantage is large enough to overcome the switching cost.
For now, Maia 200 is a signal. Microsoft is building the infrastructure for a world where inference costs determine which AI applications are viable. The chip’s 30% performance-per-dollar advantage over current fleet hardware means that models running on Maia 200 can either serve more tokens for the same cost or serve the same number of tokens for less.
That is the kind of improvement that changes product decisions. A 30% reduction in inference cost can turn a money-losing AI feature into a profitable one. It can make real-time generation feasible where batch processing was the only option. It can shift the break-even point for AI startups that burn capital on API calls.
Microsoft is not just building a faster chip. It is building an economic argument for putting more AI workloads on Azure. The chip is the mechanism. The margin is the message.