GLM-5.2 Beats GPT-5.5: China Open-Weight Model Claims SWE-bench Lead at 1/6th Cost
TL;DR
Z.ai releases GLM-5.2, a 753B open-weight model scoring 62.1 on SWE-bench Pro, beating GPT-5.5 at $4.40 per million output tokens, roughly one-sixth the cost. MIT licensed, Anthropic-compatible API, and timed perfectly as Fable 5 remains offline.
Z.ai (formerly Zhipu AI) went public with the full weights and API for GLM-5.2 on June 17. The 753B-parameter open-weight model scores 62.1 on SWE-bench Pro, edging past GPT-5.5’s 58.6. The pricing spread is the bigger story: $4.40 per million output tokens versus GPT-5.5’s $30, roughly a sixfold difference. MIT licensing means any enterprise can download, fine-tune, and deploy commercially without signing a vendor agreement.
Why the Timing Matters
GLM-5.2 didn’t arrive in a vacuum. Anthropic’s Fable 5 and Mythos 5 have been offline since June 12 under a US Department of Commerce emergency directive, now entering day 10. The ban traces back to SK Telecom’s $100M Anthropic investment and an Amazon research team’s vulnerability disclosure.
Into that supply gap, GLM-5.2 lands with an Anthropic-compatible API endpoint. Developers currently using Claude Code or Cursor can theoretically swap a base URL and keep working. Weights are available on Hugging Face (zai-org/GLM-5.2) for teams with on-premise GPU capacity.
VentureBeat noted this marks the first time a Chinese open-weight model has taken a confirmed lead on long-horizon coding benchmarks. Six months ago that sentence wouldn’t have been written seriously.
What the Numbers Actually Say
The headline scores deserve a closer read.
SWE-bench Pro 62.1 vs 58.6 is a 3.5-point gap, about 6% relative improvement. FrontierSWE 74.4% vs 72.6% is a smaller margin, and Claude Opus 4.8 still sits at 75.1% on that same benchmark. On Terminal-Bench 2.1, GPT-5.5 actually wins: 84.0 vs GLM-5.2’s 81.0. This is a category-specific lead on long-horizon coding tasks, not a general sweep.
All benchmark numbers come from Z.ai’s own reporting. No independent third-party verification exists yet. The standard caveat applies: treat self-reported numbers as marketing until replicated.
The architecture has one genuinely interesting innovation: IndexShare. The mechanism reuses sparse attention indexers across transformer layers, cutting floating-point operations by roughly 2.9x at 1M-token context length. With 753B total parameters but only ~40B active per inference (MoE), the cost advantage has a concrete engineering explanation. It’s not scale magic.
A quick cost estimate on real workloads. A typical SWE-bench-style task burns 80K to 120K output tokens. At GPT-5.5 pricing that’s $2.40 to $3.60 per task. At GLM-5.2 API pricing, $0.35 to $0.53. At 1,000 agentic coding tasks per day, the monthly delta is roughly $57K to $93K. For teams running large-scale CI/CD agentic pipelines, that’s not noise.
Self-hosting carries a higher bar. Z.ai recommends a minimum of eight H100 GPUs. Cloud spot pricing runs $25 to $35/hour, roughly $220K annually just in compute before engineering overhead. The MIT license gives smaller teams the theoretical right to run it; the hardware cost is what actually gates access.
Signals Worth Watching
Three concrete data points will determine how this plays out.
First, when Fable 5 comes back. Anthropic’s Chris Ciauri said “coming days.” If the export restriction lifts before June 30, GLM-5.2’s substitution window is roughly two weeks. An extension into July would force enterprise procurement teams to make longer-term decisions about API diversification.
Second, whether the OpenRouter Fusion DRACO numbers hold under independent testing. Reports claim that a Gemini + Kimi + DeepSeek combination reaches 64.7% DRACO scores, approaching Fable 5 performance. If that replicates, the “single best model” moat is being eroded by multi-model synthesis. That’s structurally bad news for every closed-source lab.
Third, GLM-5.2 download velocity on Hugging Face through August. DeepSeek-V3 hit 1M downloads in its first week. GLM-5.2 reaching that scale would validate Z.ai’s pricing strategy in concrete adoption numbers, not just benchmark charts.
Have you built anything on an Anthropic-compatible endpoint that wasn’t actually Anthropic? What broke, if anything?
If this was useful, subscribe to the newsletter for weekly AI PM insights and GenAI case studies.
Related Reading
Related Articles
US AI Models Crash Below 30% on OpenRouter: China Dominates Developer Traffic as OpenAI Faces IPO Pricing Dilemma
US AI models on OpenRouter fell from 70% to 30% token share in a year. DeepSeek alone holds 16.3%. ChatGPT global share dropped below 50% for the first time. OpenAI weighs deep price cuts heading into IPO. The compliance moat vs. the cost floor: which holds longer?
GPT-5.6 Sol Launches Under Government Lock: Washington's New Frontier AI Gate
OpenAI's GPT-5.6 Sol launched June 26, restricted to ~20 government-vetted partners only. Sol Ultra scores 91.9% on Terminal-Bench 2.1, but the governance framework matters more than the benchmark.