Technology RadarTechnology Radar
Assess

Full deep dive: GLM-5 Architecture Breakdown

GLM-5 is Zhipu AI's (Z.ai) 744-billion parameter open-weight model that performs within single digits of GPT-5.2 and Claude Opus 4.5 on major benchmarks — trained entirely on 100,000 Huawei Ascend chips with zero NVIDIA hardware. Released under the MIT license, it represents both the democratization of frontier AI and China's hardware independence from US semiconductor supply chains.

Architecture

GLM-5 is a decoder-only Transformer with a Mixture-of-Experts (MoE) architecture:

Component Detail
Total parameters 744B
Active per token ~40B (top-8 of 256 experts)
Sparsity ~5.9%
Attention DeepSeek Sparse Attention (DSA)
Positional encoding RoPE
Activations SwiGLU
Normalization Post-LN
Context window 205K input, 128K output
Pre-training data 28.5T tokens

Sparse MoE: The Core Trick

The defining architectural choice is sparse MoE. GLM-5 contains 744B total parameters but only activates 40B for any given input, routing each token through 8 of its 256 specialized expert sub-networks. This delivers frontier-class performance at a fraction of the compute cost of a dense model of equivalent capability.

Think of it as a hospital with 256 specialists on staff. For any patient (token), only 8 relevant specialists consult — the cardiologist doesn't weigh in on a broken arm. The full hospital has massive collective expertise, but each patient gets efficient, focused care.

DeepSeek Sparse Attention (DSA)

Integration of DSA significantly reduces deployment costs while maintaining long-context capabilities. This is a crucial architectural upgrade from GLM-4.5, keeping performance high while lowering inference costs for the 205K token context window.

The Slime RL Framework: How Post-Training Works

The most technically interesting piece of GLM-5 isn't the base model — it's Slime, the asynchronous reinforcement learning framework used for post-training.

The Problem Slime Solves

Traditional RL training for LLMs is sequential: generate a batch of inferences → wait for evaluation → update parameters → repeat. This is massively bottlenecked — the GPU cluster sits idle during evaluation.

Slime's Architecture

Slime decouples the pipeline into three independent asynchronous modules:

  1. Training (Megatron) — Reads data from the buffer, runs gradient updates, syncs parameters to rollout
  2. Rollout (SGLang + router) — Generates new data including reward/verifier outputs, stores in buffer
  3. Data Buffer — Bridge module managing prompt initialization, custom data, and rollout generation

These run in parallel. Inference, evaluation, and parameter updates happen simultaneously instead of sequentially.

APRIL: Active Partial Rollouts

Slime's key innovation is Active Partial Rollouts (APRIL) — allowing partial evaluation of incomplete trajectories. Results feed back into training without waiting for all trajectories to complete. This is critical for agentic tasks where a single trajectory might involve hours of tool use.

Mixed Precision

Rollout engines use FP8 for generation (fast, cheap) while training uses BF16 (accurate). This decoupling further improves throughput.

Scale

Post-training produces 3,000-6,000 messages per run (~60-100M output tokens), honing long-range planning and tool use capabilities.

Results: What Slime Buys You

  • 56% reduction in hallucination rate compared to GLM-4.5
  • Trained abstention — the model recognizes limits of its knowledge and refuses to fabricate rather than hallucinate
  • Fixes "token-saving" regression — a common RL flaw where models rush conclusions to minimize output. Slime's async approach lets the model learn from long, complex interactions without compute stalls

Benchmark Performance

Benchmark Score
Humanity's Last Exam (with tools) 50.4%
SWE-Bench Verified 77.8%
GPQA-Diamond 86.0%
AIME 2026 92.7%
Terminal-Bench 2.0 60.7%
Vending Bench 2 $4,432
AI Intelligence Index v4.0 First open-source model above 50

Hardware Independence

GLM-5 was trained on a 100,000-chip Huawei Ascend 910B cluster using the MindSpore framework. No NVIDIA A100s, H100s, or AMD MI300Xs. This is the first frontier model to achieve full hardware independence from US semiconductor supply chains — a geopolitically significant milestone regardless of one's perspective on the AI race.

The "Pony Alpha" Episode

On February 6, 2026, an anonymous model called "Pony Alpha" appeared on OpenRouter — free, with zero creator details. The AI community scrambled to identify it. When asked "who are you?", it answered: "I am GLM." When asked to write a page about itself, it wrote: "I am Claude, created by Anthropic." This raised questions about training data contamination and identity confusion in open models — a recurring theme as models are trained on increasingly web-scale data.

Pricing

At $1.00/M input tokens and $3.20/M output tokens, GLM-5 is significantly cheaper than GPT-5.2 or Claude Opus 4.6 for API access. The model is also fully downloadable under MIT license for self-hosting.

Why It's in Assess

GLM-5 is the most architecturally interesting open-weight model of 2026. The sparse MoE approach (744B total, 40B active) is the clearest demonstration that you don't need dense computation for frontier performance. The Slime RL framework's async approach to post-training is a genuine innovation worth studying for anyone doing RL research. The hardware independence story adds strategic importance.

However, assess carefully: the Pony Alpha episode suggests training data quality concerns, and real-world coding performance in enterprise contexts may not match benchmark scores. The model is positioned for agentic workflows, but most teams will find it easier to use via API providers than to self-host 744B parameters.

Key Characteristics

Property Value
Company Zhipu AI / Z.ai
Model GLM-5
Architecture Sparse MoE (744B total / 40B active)
Training hardware 100,000 Huawei Ascend 910B chips
Post-training Slime async RL framework (open source)
Key innovations Sparse MoE efficiency, APRIL partial rollouts, hardware independence
License MIT
Pricing $1.00/M input, $3.20/M output
Released February 11, 2026
Sources GLM-5 GitHub, Slime Framework, Architecture Deep Dive, Let's Data Science