Research / T-2026-5524

Wafer hits 2626 tok/s on GLM5.2 with AMD MI355X at over 2x lower cost than Blackwell

Q: Wafer hits 2626 tok/s on GLM5.2 with AMD MI355X at over 2x lower cost than Blackwell — key point 1

Wafer served GLM5.2 on AMD MI355X at 2626 tok/s/node and 213 tok/s single stream, achieving 80% of B200 performance at over 2x lower cost.

Q: Wafer hits 2626 tok/s on GLM5.2 with AMD MI355X at over 2x lower cost than Blackwell — key point 2

Two trivial fixes—a header guard and module prefix mismatch—unblocked 3x single-stream throughput gains on AMD's ROCm stack.

Q: Wafer hits 2626 tok/s on GLM5.2 with AMD MI355X at over 2x lower cost than Blackwell — key point 3

AMD's MI355X delivers 80% of Blackwell performance at less than half the cost, shifting inference economics for scale deployments.

Wafer serves GLM5.2 on AMD MI355X at 2626 tok/s/node and 213 tok/s single stream, outperforming Blackwell on cost while requiring no custom kernels.

Tessera Newsroom · 5 min read · July 4, 2026

Source GLM5.2 on AMD MI355X at 2626 tok/s/node at over 2x lower cost than Blackwell (wafer.ai)

FIGURE T-2026-5524

2626 RESEARCH

Wafer published a benchmark on July 3 showing it served GLM5.2 on AMD’s MI355X at 2626 tokens per second per node and 213 tok/s single stream, at over 2x lower cost than an equivalent Blackwell deployment. The result is not a record. It is a signal that AMD’s software gap is closing in real time, driven by the same force that is reshaping the entire inference stack: agent-driven kernel and model optimization.

The headline numbers matter. On a 20,000 input token, 1,000 output token workload with a 60% cache hit rate, Wafer hit an aggregate throughput of 2626 tok/s/node at 2.4 requests per second, with a ≤5s time-to-first-token knee. That is 80% of the performance measured on a B200, despite the MI355X being over 2x cheaper per GPU. On single stream decode, following Artificial Analysis standards, Wafer reached 213 tok/s on 10k input and 1.5k output tokens, served on MI355X capacity from TensorWave. That number does not top the leaderboard, but it wins on performance per dollar.

The cost math is stark. Wafer notes that AMD’s MI355X is roughly 2.75x cheaper per GPU than NVIDIA’s B300, with comparable hardware specs. NVIDIA GPU prices are climbing as demand for inference outpaces supply, driven by the cadence of frontier model releases — Claude Fable, GLM5.2, Minimax M3 — that Wafer calls “the token craze.” The Blackwell supply crunch is real, and it is making tokens expensive. AMD’s Instinct MI350 series competes at the silicon level, but NVIDIA’s software advantage and day-0 support has historically let providers serve inference much faster on NVIDIA hardware with much less friction. On the MI355X and ROCm stack, SOTA performance rarely comes out of the box.

Wafer’s engineers had to work for it. They quantized the base bf16 GLM-5.2 to MXFP4 using AMD’s Quark tool, and verified it was lossless against z-ai’s official FP8 quantization on GPQA-Diamond, tau2, and GSM8K benchmarks. The MXFP4 weights showed a slight regression on GSM8K (0.965 to 0.955) and GPQA-Diamond (0.9217 to 0.9026), but improved on tau2 macro (0.819 to 0.834). For the inference framework, Wafer chose sglang over vLLM and ATOM. vLLM had no working MXFP4 plus GlmMoeDsa path. ATOM’s output degraded at long context. Sglang had the least friction to native support.

The real work came with speculative decode. The sglang ROCm image did not support multi-token prediction out of the box. Two bugs blocked it. First, the MTP head’s shared expert was stored in bf16, but sglang’s quantization lookup failed because of a module prefix mismatch, causing a shape mismatch crash at load. Wafer fixed it by copying the layer entries to the un-quantized list under the decoder name sglang actually uses. Second, deep speculative decode at draft depth 4 or more was blocked because a fused multi-step metadata kernel included #include <cuda_runtime.h> with no ROCm guard. One #ifdef USE_ROCM guard fixed it.

Two trivial changes. But they unblocked close to a 3x gain in single stream throughput.

For aggregate throughput, the bottleneck was prefill, not decode. At tensor parallelism 8, the MI355X ran GLM5.2-MXFP4 at 1461 tok/s/node. Switching to TP4 with data parallelism 2 got to 1944 tok/s/node at 2.0 RPS. Still slow compared to Blackwell’s 3192 tok/s/node at 3.0 RPS. The reason: sglang’s fp4 MoE was silently running a slow FlyDSL heuristic fallback, because AMD’s Aiter library only shipped tuned configs for the a8w8 and fp8 paths. Wafer tuned the MoE kernel selection themselves on GLM’s fp4 shapes — model dimension 6144, MoE intermediate 2048, 256 experts, top-8. That got them to 2626 tok/s/node at 2.4 RPS.

The key detail: Wafer wrote no custom kernels this time. “SOTA on AMD is becoming more a matter of support, not software,” the post states. “The CUDA moat is eroding in real time.”

This is the real story. The CUDA moat has always been described as a software advantage: better libraries, better compilers, better day-0 support. But what Wafer’s benchmark shows is that the moat is not a single deep channel. It is a collection of shallow ditches that can be filled one at a time by agent-driven engineering. The two bugs Wafer fixed — a header include guard and a module prefix mismatch — are the kind of friction that NVIDIA’s ecosystem has trained the industry to treat as insurmountable. They are not. They are just friction.

The implications for AI builders are concrete. If AMD’s MI355X can deliver 80% of Blackwell performance at less than half the cost, the economic calculus for inference shifts. A startup serving a 20k-token workload at scale could cut its inference bill by more than half by switching to AMD hardware, accepting a modest throughput penalty. For workloads that are prefill-bound or cache-heavy, the tradeoff improves further. Wafer’s 60% cache hit rate assumption is generous, but not unrealistic for production deployments with repeated user queries.

There are caveats. Wafer’s benchmark is single-node. Multi-node performance, where NVIDIA’s NVLink and InfiniBand advantages compound, is not measured. The tuning work, while light by AMD standards, still required days of engineering and compute. The sglang image needed patches that are not yet upstreamed. And the MXFP4 quantization, while lossless on the evaluated benchmarks, may degrade on other tasks or at longer contexts. Wafer acknowledges that “SOTA on AMD rarely comes out of the box” and that “building and optimizing for the newest models can require weeks of engineering.”

But the trend is clear. Each time a team like Wafer publishes a benchmark, the gap shrinks. The next frontier model will have a few less ditches to fill. The one after that, fewer still. And as agents improve at kernel and model optimization, the manual tuning Wafer did this time will become automated.

The question is not whether AMD will catch up. It is whether NVIDIA’s software moat will erode faster than AMD’s hardware advantage can be exploited. For now, the answer is that a team with a few weeks of engineering time can serve GLM5.2 at 2626 tok/s/node on AMD hardware at over 2x lower cost than Blackwell. That is a number that changes the math for anyone buying inference at scale.

Tessera Newsroom

Editorial

Masthead Contact

T-REL / RESEARCH

Claude-real-video: the hack that lets any LLM actually watch a video

A new open-source tool, claude-real-video, extracts meaningful frames from any video and feeds them directly to an LLM — no cloud upload, no transcript-only blind spot.

Tessera Newsroom · July 3, 2026

Research / T-2026-7117

Parsewise (YC P25) bets that document AI's bottleneck is trust, not extraction

Parsewise, founded by ex-Palantir and Bain engineers, launches an API for cross-document data extraction with word-level citations, beating Gemini on the Databricks OfficeQA…

Tessera Newsroom · July 2, 2026

Research / T-2026-0414

Meituan's LongCat-2.0: A 1.6T-Parameter MoE Trained on Chinese Chips, Now Open Source

Meituan releases LongCat-2.0, a 1.6T-parameter open-source MoE model trained on 50,000 domestic chips, with a 1M context window and top-3 OpenRouter rankings.

Tessera Newsroom · June 30, 2026