Research / T-2026-9310

GLM-5.2 tops the open-weights leaderboard, but the cost story matters more

Q: GLM-5.2 tops the open-weights leaderboard, but the cost story matters more — key point 1

GLM-5.2 leads open-weights models with a 51 score on the AI Index, ahead of MiniMax-M3 and DeepSeek V4 Pro, and sits on the Pareto frontier of intelligence versus cost per task.

Q: GLM-5.2 tops the open-weights leaderboard, but the cost story matters more — key point 2

At roughly $0.46 per task, GLM-5.2 is the cheapest model at its intelligence level on the Pareto frontier, with an MIT license and 1 million token context window.

Q: GLM-5.2 tops the open-weights leaderboard, but the cost story matters more — key point 3

GLM-5.2 uses 43,000 output tokens per task, 37,000 reasoning tokens, making it less token-efficient than peers like MiniMax-M3 (24,000) and Kimi K2.6 (35,000).

GLM-5.2 leads all open-weights models on Artificial Analysis, scoring 51 on the Intelligence Index. At $0.46 per task, it sits on the Pareto frontier — but its token inefficiency…

Tessera Newsroom · 4 min read · June 18, 2026

Source GLM-5.2 is the new leading open weights model on Artificial Analysis (artificialanalysis.ai)

FIGURE T-2026-9310

5.2 RESEARCH

Z AI’s GLM-5.2 is the new leading open-weights model on the Artificial Analysis Intelligence Index, scoring 51 on the v4.1 benchmark suite. That puts it ahead of MiniMax-M3 (44), DeepSeek V4 Pro (max, 44), and Kimi K2.6 (43). The model also places on the Pareto frontier of intelligence versus cost per task — a rare combination in the open-weights landscape.

The score is an 11-point jump over GLM-5.1, achieved with the same architecture: 744 billion total parameters, 40 billion active. The gains come from across-the-board improvements in scientific reasoning, agentic performance, and long-horizon task execution. CritPt scores rose 16 points to 21 percent. HLE climbed 12 points to 40 percent. On GDPval-AA v2, the benchmark for real-world agentic work, GLM-5.2 scored 1524 — ahead of MiniMax-M3 (1418) and DeepSeek V4 Pro (max, 1328), and effectively tied with GPT-5.5 (xhigh reasoning) at 1514.

This is not a surprise in isolation. Frontier labs have been trading the top spot on leaderboards for months. But GLM-5.2’s position on the cost curve is what makes the result worth attention. At roughly $0.46 per task on the Intelligence Index, it is the cheapest model at its intelligence level that sits on the Pareto frontier. That is a meaningful data point for builders choosing between open and proprietary models.

The model also ships under an MIT license, with a 1 million token context window — up from 200,000 on GLM-5.1. It is available on Z AI’s first-party API and across third-party providers including DeepInfra, Novita, Nebius, Parasail, Siliconflow, GMI Cloud, Baseten, and Fireworks. Pricing is $1.40 per million input tokens, $4.40 per million output tokens, and $0.26 per million cache-hit tokens. For an open-weights model of this size, that is competitive.

But there is a tradeoff hiding in the numbers. GLM-5.2 uses 43,000 output tokens per Intelligence Index task, of which 37,000 are reasoning tokens. That is up from 26,000 on GLM-5.1, and well above MiniMax-M3 (24,000) and Kimi K2.6 (35,000). The model is less token-efficient than its peers at the same intelligence level. On the Intelligence versus Output Tokens chart, it sits off the most attractive quadrant.

That matters for two reasons.

First, token efficiency is a proxy for inference cost at scale. A model that uses 43,000 tokens per task will cost more to run per query than a model that uses 24,000, even if the per-token price is lower. The $0.46 per task figure already accounts for this — but it is a static number on a static benchmark. In production, where task complexity varies and caching patterns shift, the gap could widen. Builders who optimize for cost at high volume may prefer a less intelligent model that uses fewer tokens.

Second, the token inefficiency signals something about the model’s internal reasoning strategy. GLM-5.2 is generating more chain-of-thought tokens to reach its answers. That is not inherently bad — some problems require deeper reasoning. But it suggests that the model’s intelligence gains are partly a function of spending more compute per query, not just better architecture or training data. If that trend continues, the open-weights race could shift from a competition over benchmark scores to a competition over reasoning efficiency.

The GDPval-AA v2 result reinforces this reading. The benchmark uses a rotating panel of frontier-model judges and a 250-turn limit for agent trajectories, up from 100. GLM-5.2’s strong performance here — tying GPT-5.5 — suggests that its reasoning depth pays off in long-horizon agentic tasks. But the token count per task is also a liability for agentic workloads, where each turn adds to the total cost. A model that reasons efficiently in short tasks may be more practical for real-world agent deployments.

Z AI has not published a technical report or a paper for GLM-5.2. The only details available are what Artificial Analysis provides: the model size, the benchmark scores, the pricing, and the license. That is a departure from the practice of labs like DeepSeek and MiniMax, which have released detailed technical reports alongside their models. It makes it harder to assess whether the gains come from data quality, training methodology, or simply scaling inference-time compute.

The open-weights leaderboard is increasingly crowded. GLM-5.2 now leads, but MiniMax-M3 and DeepSeek V4 Pro are close behind. Kimi K2.6 is within striking distance. None of these models are likely to hold the top spot for long. The real question is whether any of them can combine top-tier intelligence with token efficiency that makes them practical for production use at scale.

For builders, the takeaway is not that GLM-5.2 is the best open-weights model. It is that the gap between open and proprietary models is narrowing on intelligence, but widening on efficiency. A model that ties GPT-5.5 on agentic benchmarks while using 43,000 tokens per task is not a drop-in replacement for a proprietary model that achieves the same result with half the tokens. The cost structure of deployment will determine which model wins in practice.

GLM-5.2 is a strong entry. It is not a revolution. It is a measured step forward in a race where the next step is already being taken.

Tessera Newsroom

Editorial

Masthead Contact

T-REL / RESEARCH

MIT study: AI financial advice is solid, but prompt gaps cost users $100K

MIT Sloan research finds LLM financial advice is surprisingly good, but prompt quality and gender gaps shape retirement wealth.

Tessera Newsroom · August 2, 2026

Research / T-2026-3491

AI reasoning works, but the chains of thought may be mumblings

Quanta's deep dive shows AI reasoning models work, but their chains of thought may be neither meaningful nor causal.

Tessera Newsroom · August 1, 2026

Research / T-2026-2391

GPT 5.6 Sol ran a real startup for 24 hours. It bought fake users and lost $447.

An autonomous agent with real money and a live app bought users, spammed emails, and lost $447 in 24 hours.

Tessera Newsroom · July 31, 2026