Gemma 4 (released April 2, 2026) is Google DeepMind's fourth-generation open-weight model family — four variants from 2.3B to 31B parameters, all multimodal, all under Apache 2.0, with a 31B Dense model that scores 89.2% on AIME 2026 and a 26B MoE variant that matches it at 88.3% using only 3.8B active parameters.
Why It's in Trial
Gemma 4 earns Trial on strong benchmarks, an agentic-first architecture, and a license upgrade that removes the main commercial friction of prior generations:
- Massive benchmark leap: Gemma 4 31B scores 89.2% on AIME 2026 vs. 20.8% for Gemma 3 27B. Codeforces ELO jumps from 110 to 2150. LiveCodeBench nearly triples. These are not incremental improvements.
- Apache 2.0 license: First time in the Gemma lineage. No custom clauses, no "Harmful Use" carve-outs, no redistribution restrictions. This removes the enterprise legal friction that kept Gemma 3 at arm's length for many teams.
- MoE efficiency: The 26B A4B MoE activates only 3.8B parameters at inference — near-E4B latency with near-31B quality on reasoning benchmarks.
- Agentic focus: Thinking mode produces 4,000+ token step-by-step reasoning chains before committing to output. The hybrid attention architecture (local sliding window interleaved with full global attention) is tuned for long-context agentic tasks.
- Audio on edge models: E2B and E4B support native audio input (speech recognition and understanding), not just text+image.
Stays in Trial rather than Adopt because: released April 2, 2026 — adoption data is not yet available. The "3+ named orgs in production" bar requires time. Benchmark improvements are extraordinary; production evidence will follow.
When NOT to Use Gemma 4
- Frontier coding agents: Gemma 4 trails Qwen 3.5, GLM-5, and Kimi K2.5 on some benchmarks despite the improvements. For highest-stakes agentic coding, these remain competitive alternatives.
- Llama 4 Scout for very long context: Llama 4 Scout offers a 10M token context window; Gemma 4's 256K cap is sufficient for most workloads but not exhaustive document corpora.
- Fine-tuning pipelines already on Gemma 3: The architecture changes (MoE, hybrid attention) mean fine-tuning workflows don't transfer directly.
Gemma 4 Model Family
| Variant | Effective Params | Active Params | Context | Modalities | Best For |
|---|---|---|---|---|---|
| E2B | 2.3B | 2.3B | 128K | text + image + audio | Mobile/IoT, speech apps |
| E4B | 4.5B | 4.5B | 128K | text + image + audio | Edge servers, mid-tier mobile |
| 26B A4B | 26B total | 3.8B (MoE) | 256K | text + image | Low-latency inference, production APIs |
| 31B Dense | 31B | 31B | 256K | text + image | Quality-first workloads, fine-tuning |
Architecture
Gemma 4 introduces two significant architectural changes from Gemma 3:
- Hybrid attention: Interleaves local sliding window attention with full global attention layers. The final layer is always global. This delivers small-model memory footprint with large-model long-context awareness — particularly relevant for agentic tasks with long tool-call histories.
- Thinking mode: All four variants support thinking mode, triggered via
enable_thinking=Truein the API or by including the<|think|>token in the system prompt. When enabled, models produce extended reasoning chains (4,000+ tokens) before committing to output. On the larger models (26B, 31B), the thinking scaffolding is always present — even when disabled, they emit empty thinking tags — while E2B and E4B cleanly suppress them. The capability is baked into the base model rather than a separate fine-tune. (Google AI docs: thinking mode)
The 26B A4B uses standard MoE routing with 3.8B active parameters out of 26B total — comparable token throughput to E4B but with near-31B quality on reasoning tasks.
Benchmarks
| Benchmark | E2B | E4B | 26B A4B | 31B Dense | Gemma 3 27B |
|---|---|---|---|---|---|
| AIME 2026 | 37.5% | 42.5% | 88.3% | 89.2% | 20.8% |
| LiveCodeBench | 44.0% | 52.0% | — | 80% | — |
| MMLU Pro | — | — | — | 85.2% | — |
| GPQA | — | — | — | 84% | — |
| Codeforces ELO | — | — | — | 2150 | 110 |
Sources: Google DeepMind blog, ai.rs comparison. AIME/Codeforces figures reflect thinking mode.
License & Deployment
- Apache 2.0 — commercially permissive, no custom carve-outs (departure from prior Gemma License)
- Weights on Hugging Face: google/gemma-4-*
- NVIDIA-optimized builds for RTX local inference (via NVIDIA TensorRT-LLM)
- ARM-optimized builds for on-device mobile deployment
- Unsloth support for local quantized inference
Key Characteristics
| Property | Value |
|---|---|
| Parameter range | 2.3B–31B (effective); 3.8B active for 26B MoE |
| Latest generation | Gemma 4 (April 2, 2026) |
| Architecture | Dense (E2B, E4B, 31B) + MoE (26B A4B) |
| Attention | Hybrid: local sliding window + global |
| Context window | 128K (edge), 256K (larger) |
| Modalities | Text + image (all); + audio (E2B, E4B) |
| License | Apache 2.0 |
| Provider | Google DeepMind |
Further Reading
- Gemma 4: Byte for byte, the most capable open models (Google DeepMind, April 2026)
- Welcome Gemma 4: Frontier multimodal intelligence on device (Hugging Face blog)
- Google releases Gemma 4 under Apache 2.0 (VentureBeat)
- Gemma 4 on Arm: Optimized on-device AI (Arm Newsroom)
- Gemma 4 on Hugging Face: google/gemma-4-26B-A4B