Z AI’s GLM-5.2 is the new leading open-weights model on the Artificial Analysis Intelligence Index, scoring 51 on the v4.1 benchmark suite. That puts it ahead of MiniMax-M3 (44), DeepSeek V4 Pro (max, 44), and Kimi K2.6 (43). The model also places on the Pareto frontier of intelligence versus cost per task — a rare combination in the open-weights landscape.
The score is an 11-point jump over GLM-5.1, achieved with the same architecture: 744 billion total parameters, 40 billion active. The gains come from across-the-board improvements in scientific reasoning, agentic performance, and long-horizon task execution. CritPt scores rose 16 points to 21 percent. HLE climbed 12 points to 40 percent. On GDPval-AA v2, the benchmark for real-world agentic work, GLM-5.2 scored 1524 — ahead of MiniMax-M3 (1418) and DeepSeek V4 Pro (max, 1328), and effectively tied with GPT-5.5 (xhigh reasoning) at 1514.
This is not a surprise in isolation. Frontier labs have been trading the top spot on leaderboards for months. But GLM-5.2’s position on the cost curve is what makes the result worth attention. At roughly $0.46 per task on the Intelligence Index, it is the cheapest model at its intelligence level that sits on the Pareto frontier. That is a meaningful data point for builders choosing between open and proprietary models.
The model also ships under an MIT license, with a 1 million token context window — up from 200,000 on GLM-5.1. It is available on Z AI’s first-party API and across third-party providers including DeepInfra, Novita, Nebius, Parasail, Siliconflow, GMI Cloud, Baseten, and Fireworks. Pricing is $1.40 per million input tokens, $4.40 per million output tokens, and $0.26 per million cache-hit tokens. For an open-weights model of this size, that is competitive.
But there is a tradeoff hiding in the numbers. GLM-5.2 uses 43,000 output tokens per Intelligence Index task, of which 37,000 are reasoning tokens. That is up from 26,000 on GLM-5.1, and well above MiniMax-M3 (24,000) and Kimi K2.6 (35,000). The model is less token-efficient than its peers at the same intelligence level. On the Intelligence versus Output Tokens chart, it sits off the most attractive quadrant.
That matters for two reasons.
First, token efficiency is a proxy for inference cost at scale. A model that uses 43,000 tokens per task will cost more to run per query than a model that uses 24,000, even if the per-token price is lower. The $0.46 per task figure already accounts for this — but it is a static number on a static benchmark. In production, where task complexity varies and caching patterns shift, the gap could widen. Builders who optimize for cost at high volume may prefer a less intelligent model that uses fewer tokens.
Second, the token inefficiency signals something about the model’s internal reasoning strategy. GLM-5.2 is generating more chain-of-thought tokens to reach its answers. That is not inherently bad — some problems require deeper reasoning. But it suggests that the model’s intelligence gains are partly a function of spending more compute per query, not just better architecture or training data. If that trend continues, the open-weights race could shift from a competition over benchmark scores to a competition over reasoning efficiency.
The GDPval-AA v2 result reinforces this reading. The benchmark uses a rotating panel of frontier-model judges and a 250-turn limit for agent trajectories, up from 100. GLM-5.2’s strong performance here — tying GPT-5.5 — suggests that its reasoning depth pays off in long-horizon agentic tasks. But the token count per task is also a liability for agentic workloads, where each turn adds to the total cost. A model that reasons efficiently in short tasks may be more practical for real-world agent deployments.
Z AI has not published a technical report or a paper for GLM-5.2. The only details available are what Artificial Analysis provides: the model size, the benchmark scores, the pricing, and the license. That is a departure from the practice of labs like DeepSeek and MiniMax, which have released detailed technical reports alongside their models. It makes it harder to assess whether the gains come from data quality, training methodology, or simply scaling inference-time compute.
The open-weights leaderboard is increasingly crowded. GLM-5.2 now leads, but MiniMax-M3 and DeepSeek V4 Pro are close behind. Kimi K2.6 is within striking distance. None of these models are likely to hold the top spot for long. The real question is whether any of them can combine top-tier intelligence with token efficiency that makes them practical for production use at scale.
For builders, the takeaway is not that GLM-5.2 is the best open-weights model. It is that the gap between open and proprietary models is narrowing on intelligence, but widening on efficiency. A model that ties GPT-5.5 on agentic benchmarks while using 43,000 tokens per task is not a drop-in replacement for a proprietary model that achieves the same result with half the tokens. The cost structure of deployment will determine which model wins in practice.
GLM-5.2 is a strong entry. It is not a revolution. It is a measured step forward in a race where the next step is already being taken.