Hardware / T-2026-8000

The Inference Hardware Bazaar: NVIDIA Still Leads, but the Alternatives Are Real

Q: The Inference Hardware Bazaar: NVIDIA Still Leads, but the Alternatives Are Real — key point 1

NVIDIA holds ~92% of discrete GPU market, but specialized inference hardware from startups like FuriosaAI and Positron is winning real deployments on power efficiency and TCO.

Q: The Inference Hardware Bazaar: NVIDIA Still Leads, but the Alternatives Are Real — key point 2

Hyperscalers and startups are fragmenting the market: Broadcom co-designs for OpenAI, Meta acquired Rivos, Google launched Ironwood TPU, and NVIDIA acquired Groq's LPU for $20B.

Q: The Inference Hardware Bazaar: NVIDIA Still Leads, but the Alternatives Are Real — key point 3

Energy is a key driver: AI could consume 6.7–12% of U.S. power by 2028, making low-power designs like FuriosaAI's 3 kW server a competitive moat for enterprise inference.

An enterprise guide to the LLM inference hardware landscape shows NVIDIA at 92% GPU market share, but specialized chips from Cerebras, FuriosaAI, and others are winning real…

Tessera Newsroom · 5 min read · June 18, 2026

Source LLM Inference Hardware: An Enterprise Guide to Key Players (intuitionlabs.ai)

FIGURE T-2026-8000

92% HARDWARE

The enterprise inference hardware market is no longer a one-vendor story. A comprehensive guide published by IntuitionLabs in February 2026 makes the case that while NVIDIA still commands roughly 92% of the discrete GPU market as of the first half of 2025, the competitive landscape is accelerating faster than at any point since the AI boom began. The guide catalogs dozens of vendors, from hyperscaler custom silicon to inference-optimized startups, and the picture it paints is of a market fragmenting along real lines of power efficiency, memory architecture, and total cost of ownership.

NVIDIA remains the default. Its Blackwell generation (B200) and the follow-on Blackwell Ultra (B300) are shipping in volume, and the Vera Rubin platform, announced at CES 2026, is already in production. The Rubin GPU packs 336 billion transistors, 288 GB of HBM4, and delivers 50 PFLOPS NVFP4 inference per GPU. A Vera Rubin NVL72 rack promises 3.6 EFLOPS dense FP4 inference. These are staggering numbers, and they come with the CUDA ecosystem that keeps enterprises locked in. Dell is reportedly close to a $5 billion deal to supply NVIDIA GPU-powered servers to xAI, and HPE secured a $1 billion contract with X. The incumbency is real.

But the guide documents a pattern that matters more than any single chip spec: specialized inference hardware is winning real deployments on metrics that matter to operators, not just benchmark chasers. FuriosaAI, a South Korean startup, offers an LLM-scale inference server powered by its RNGD chips that consumes roughly 3 kW, compared to approximately 10 kW for a comparable NVIDIA DGX system. The company claims five such systems can fit in a single rack versus one DGX. LG is already using FuriosaAI chips in production. Positron’s Atlas accelerator claims 280 tokens per second on LLaMA 3.1 8B at less than half the power of a DGX node. These are not paper numbers from a press release. They are shipping products with named customers.

The guide also captures the tectonic shifts in who builds the silicon. Hyperscalers are moving aggressively into custom chips. Broadcom is co-designing chips for OpenAI. Meta completed its acquisition of RISC-V startup Rivos for roughly $2 billion. Microsoft launched its Maia 200 inference accelerator. Google made its Ironwood TPU generally available. And in the most dramatic consolidation move of the past year, NVIDIA acquired Groq’s LPU technology and engineering team for approximately $20 billion in December 2025. That acquisition was a defensive acknowledgment that the LPU architecture, built for deterministic low-latency inference, posed a genuine threat in a specific slice of the market.

The energy angle is the underappreciated driver. The guide cites a DOE-backed report warning that AI could drive U.S. data centers to consume 6.7 to 12 percent of national power by 2028. AI workloads have already doubled data center energy use since 2017. For enterprises running inference at scale, power is no longer an operational detail. It is a budget line item that determines whether a deployment pencils out. That is why FuriosaAI’s 3 kW claim and d-Matrix’s 3D memory design (Pavehawk), which co-locates compute and memory to reduce data movement, are not niche engineering curiosities. They are competitive moats.

The guide’s most useful contribution is its catalog of system integrators. Dell, HPE, Lenovo, Super Micro, and IBM are all packaging these chips into turnkey AI servers. The enterprise buyer does not need to choose between a raw GPU and a startup’s PCIe card. They can buy a PowerEdge server with Blackwell Ultra GPUs or a ProLiant with AMD’s MI350X. The integrators are becoming the gatekeepers of hardware diversity, and their willingness to support multiple vendors is what gives the challengers a path to revenue.

What the guide does not fully address is the software gap. NVIDIA’s CUDA and TensorRT are mature, documented, and supported by every major framework. The challengers each have their own compiler stacks and SDKs. Tenstorrent has its TT-BUDA. SambaNova has its Dataflow architecture and compiler. Cerebras has its Wafer-Scale Engine and CS-3 system. For an enterprise team that just wants to deploy a model without hiring a compiler engineer, that fragmentation is a real cost. The guide notes that AMD acquired the entire engineering team from Untether AI in June 2025 to bolster its AI compiler and SoC capabilities. That acquisition signals that the software stack is the next battleground, not the hardware.

The market numbers in the guide put the opportunity in context. The global AI inference market is projected to reach roughly $255 billion by 2030. NVIDIA’s data center revenue alone hit $193.7 billion in its most recent fiscal year. The challengers are not going to displace NVIDIA at the high end of training clusters. But inference is a different game. It is about throughput per watt, latency per request, and cost per token at scale. Those metrics favor specialized architectures in a way that training benchmarks do not.

The most telling data point in the guide is the OpenAI-Cerebras partnership announced in January 2026, which provides 750 MW of compute. That is not a pilot program. That is a hyperscaler-level commitment to an alternative architecture. If OpenAI, which has the deepest pockets and the most demanding inference workloads in the industry, is willing to bet on a non-NVIDIA wafer-scale system, the rest of the enterprise market has permission to evaluate alternatives.

The inference hardware market is becoming a bazaar, not a cathedral. NVIDIA still owns the cathedral. But the bazaar is where the enterprise buyer will find the power-efficient, cost-effective machine that fits their actual workload. The guide makes that case with enough named customers, shipped products, and wattage figures to take it seriously.

Tessera Newsroom

Editorial

Masthead Contact

T-REL / HARDWARE

AMD's Helios Rack Takes the Fight to NVIDIA at the System Level

AMD's Helios rackscale system pairs Venice CPUs, MI455X GPUs and Pensando fabric to challenge NVIDIA's NVL72 on memory, bandwidth and open standards.

Tessera Newsroom · August 1, 2026

Hardware / T-2026-8739

AMD's Helios Rack Puts a 30% Token-Per-Dollar Bet on Agentic AI

AMD's Helios rack-scale AI system claims 30% more tokens per dollar, with OpenAI, Anthropic, and Meta as launch partners.

Tessera Newsroom · July 31, 2026

Hardware / T-2026-3661

AMD's Helios rack system lands Microsoft, and the real fight is software

AMD's Helios rack system lands Microsoft as a customer. The hardware competes. The software story is the open question.

Tessera Newsroom · July 31, 2026