AMD laid out its AI accelerator roadmap through 2027 at its Advancing AI event in San Jose, and the message was unambiguous: the company intends to compete for the data center GPU crown, not just the second-place slot. The Storage Review report on the event details a generational leap in the Instinct MI350 series and a preview of the MI400 and MI500 platforms that follow. For an AI industry increasingly worried about single-vendor dependency in compute, AMD’s timeline matters.
The MI350 series, shipping now, is the first product built on AMD’s CDNA4 architecture. It represents a fundamental re-engineering of AMD’s AI accelerator approach, moving from the MI300’s 5nm node to a 3nm Node 3+ process for its compute chiplets. The GPU package contains eight Accelerator Compute Chiplets (XCDs), each with 32 CDNA4 compute units, for a total of 256 CUs. That is a reduction from the MI300X’s 304 CUs, but the tradeoff is deliberate: each CU now has access to more memory bandwidth per clock, and the per-CU throughput for 16-bit and 8-bit operations has doubled.
The memory subsystem is where AMD makes its most aggressive play. Eight stacks of HBM3E deliver 288GB of capacity per GPU, with each 36GB stack running at 8Gbps per pin. That is a 50% capacity increase over the MI300X’s 192GB, and AMD claims approximately 1.3 times higher memory bandwidth per watt than the previous generation. For AI workloads, where model weights and KV caches consume memory at an accelerating rate, 288GB per accelerator is a meaningful threshold. It enables larger models to fit on fewer GPUs, reducing the communication overhead that plagues distributed training and inference.
The CDNA4 compute bet
AMD’s architectural choices in CDNA4 reveal a clear thesis about where AI compute is heading. The matrix engines now support OCP MX-specified micro-scale formats FP6 and FP4, and critically, AMD’s implementation runs FP6 at the same computational rate as FP4. This is not a trivial engineering detail. FP6 is an emerging format that offers a better accuracy-to-bit-width tradeoff than FP4 for many training workloads. By making FP6 first-class, AMD positions itself to capture workloads that require more precision than FP4 can provide but cannot afford the memory and bandwidth costs of FP8.
The MI350 also introduces hardware-supported stochastic rounding, which mitigates bias when downcasting from FP32 to lower precision formats. This matters for training stability, especially as models push into lower-bit representations. A new vector ALU supports 2-bit operations and can accumulate BF16 results into FP32, further expanding the precision flexibility.
AMD has also increased the Local Data Share (LDS) size per CU and enhanced transcendental function throughput to keep pace with the tensor core improvements. These are the kinds of micro-architectural details that separate a serious AI accelerator from a repurposed graphics card. The company is clearly designing for the attention-heavy, softmax-intensive operations that dominate modern transformer workloads.
Partitioning for the cloud
One of the more strategically important features of the MI350 series is its enhanced partitioning capability. The GPU supports two NUMA modes: NPS1, which treats the entire 288GB as a single memory domain, and NPS2, which divides memory across the two IO Dies. On the compute side, the GPU can be split into one, two, four, or eight independent partitions, configurable through SR-IOV for virtualized environments.
This is a direct response to the cloud provider demand for multi-tenant GPU deployments. A single MI350 can serve multiple customers or multiple workloads simultaneously, with hardware-enforced isolation. AMD chose not to support NPS4 mode on the MI350, arguing that the tighter memory coupling within each IOD diminished the benefits of further subdivision. That is a defensible engineering tradeoff, but it limits flexibility for certain fine-grained multi-tenant scenarios.
The MI350 series ships in two variants: the MI350X at up to 1kW TDP for air-cooled deployments, and the MI355X at up to 1.4kW TDP for liquid-cooled environments. AMD claims the MI355X delivers approximately 20% higher performance in real-world workloads due to higher sustained clock frequencies. That thermal headroom gap is significant for data center operators planning their cooling infrastructure.
The roadmap beyond
AMD previewed the MI400 and MI500 platforms, though details remain sparse. The MI400, expected in 2026, will reportedly introduce a new architecture built on even more advanced packaging. The MI500, targeting 2027, is positioned as AMD’s answer to Nvidia’s next-generation architectures. The cadence is aggressive: annual or near-annual major architecture releases, a pace that AMD has historically struggled to maintain.
The company also provided deeper details on its Pensando Pollara networking, a P4-programmable AI NIC designed to address the networking bottlenecks that emerge at scale. The Pollara 400 AI NIC features a programmable MPU core that can adapt to new protocols and transport mechanisms through software updates. For large-scale AI clusters, where communication overhead can become the dominant cost, programmable networking is a differentiator.
What is missing from the roadmap is equally telling. AMD did not announce a direct competitor to Nvidia’s NVLink domain technology, which enables GPU-to-GPU communication at bandwidths far exceeding PCIe. The MI350’s seven Infinity Fabric links per GPU provide bisectional bandwidth of approximately 153.6 GB/s per link in each direction, but that still trails Nvidia’s NVLink 5 bandwidth in the Blackwell generation. For the largest training runs, where thousands of GPUs must synchronize gradients, that gap matters.
What this means for AI builders
For AI researchers and infrastructure operators, AMD’s roadmap offers something Nvidia cannot: a genuine alternative with competitive specifications and a credible timeline. The 288GB HBM3E capacity is immediately useful for inference workloads on large models, and the FP6 support could become a differentiator as the format gains adoption in training pipelines.
The real test is software. AMD’s ROCm stack has improved dramatically over the past two years, but it still lags CUDA in ecosystem maturity, library support, and developer mindshare. PyTorch support for CDNA4 is expected at launch, but the long tail of custom kernels, inference engines, and optimization libraries will take time to port. AMD’s decision to support OCP MX formats natively may accelerate adoption if major frameworks embrace the standard.
The MI350 series ships now. The MI400 and MI500 are promises on a slide deck. For AMD to close the gap with Nvidia, it needs to deliver those promises on schedule and ensure the software ecosystem keeps pace. The hardware is competitive. The execution clock is ticking.