Technology RadarTechnology Radar
Trial

Gemma 4 (released April 2, 2026) is Google DeepMind's fourth-generation open-weight model family — four variants from 2.3B to 31B parameters, all multimodal, all under Apache 2.0, with a 31B Dense model that scores 89.2% on AIME 2026 and a 26B MoE variant that matches it at 88.3% using only 3.8B active parameters.

Why It's in Trial

Gemma 4 earns Trial on strong benchmarks, an agentic-first architecture, and a license upgrade that removes the main commercial friction of prior generations:

  • Massive benchmark leap: Gemma 4 31B scores 89.2% on AIME 2026 vs. 20.8% for Gemma 3 27B. Codeforces ELO jumps from 110 to 2150. LiveCodeBench nearly triples. These are not incremental improvements.
  • Apache 2.0 license: First time in the Gemma lineage. No custom clauses, no "Harmful Use" carve-outs, no redistribution restrictions. This removes the enterprise legal friction that kept Gemma 3 at arm's length for many teams.
  • MoE efficiency: The 26B A4B MoE activates only 3.8B parameters at inference — near-E4B latency with near-31B quality on reasoning benchmarks.
  • Agentic focus: Thinking mode produces 4,000+ token step-by-step reasoning chains before committing to output. The hybrid attention architecture (local sliding window interleaved with full global attention) is tuned for long-context agentic tasks.
  • Audio on edge models: E2B and E4B support native audio input (speech recognition and understanding), not just text+image.

Stays in Trial rather than Adopt because: released April 2, 2026 — adoption data is not yet available. The "3+ named orgs in production" bar requires time. Benchmark improvements are extraordinary; production evidence will follow.

When NOT to Use Gemma 4

  • Frontier coding agents: Gemma 4 trails Qwen 3.5, GLM-5, and Kimi K2.5 on some benchmarks despite the improvements. For highest-stakes agentic coding, these remain competitive alternatives.
  • Llama 4 Scout for very long context: Llama 4 Scout offers a 10M token context window; Gemma 4's 256K cap is sufficient for most workloads but not exhaustive document corpora.
  • Fine-tuning pipelines already on Gemma 3: The architecture changes (MoE, hybrid attention) mean fine-tuning workflows don't transfer directly.

Gemma 4 Model Family

Variant Effective Params Active Params Context Modalities Best For
E2B 2.3B 2.3B 128K text + image + audio Mobile/IoT, speech apps
E4B 4.5B 4.5B 128K text + image + audio Edge servers, mid-tier mobile
26B A4B 26B total 3.8B (MoE) 256K text + image Low-latency inference, production APIs
31B Dense 31B 31B 256K text + image Quality-first workloads, fine-tuning

Architecture

Gemma 4 introduces two significant architectural changes from Gemma 3:

  • Hybrid attention: Interleaves local sliding window attention with full global attention layers. The final layer is always global. This delivers small-model memory footprint with large-model long-context awareness — particularly relevant for agentic tasks with long tool-call histories.
  • Thinking mode: All four variants support thinking mode, triggered via enable_thinking=True in the API or by including the <|think|> token in the system prompt. When enabled, models produce extended reasoning chains (4,000+ tokens) before committing to output. On the larger models (26B, 31B), the thinking scaffolding is always present — even when disabled, they emit empty thinking tags — while E2B and E4B cleanly suppress them. The capability is baked into the base model rather than a separate fine-tune. (Google AI docs: thinking mode)

The 26B A4B uses standard MoE routing with 3.8B active parameters out of 26B total — comparable token throughput to E4B but with near-31B quality on reasoning tasks.

Benchmarks

Benchmark E2B E4B 26B A4B 31B Dense Gemma 3 27B
AIME 2026 37.5% 42.5% 88.3% 89.2% 20.8%
LiveCodeBench 44.0% 52.0% 80%
MMLU Pro 85.2%
GPQA 84%
Codeforces ELO 2150 110

Sources: Google DeepMind blog, ai.rs comparison. AIME/Codeforces figures reflect thinking mode.

License & Deployment

  • Apache 2.0 — commercially permissive, no custom carve-outs (departure from prior Gemma License)
  • Weights on Hugging Face: google/gemma-4-*
  • NVIDIA-optimized builds for RTX local inference (via NVIDIA TensorRT-LLM)
  • ARM-optimized builds for on-device mobile deployment
  • Unsloth support for local quantized inference

Key Characteristics

Property Value
Parameter range 2.3B–31B (effective); 3.8B active for 26B MoE
Latest generation Gemma 4 (April 2, 2026)
Architecture Dense (E2B, E4B, 31B) + MoE (26B A4B)
Attention Hybrid: local sliding window + global
Context window 128K (edge), 256K (larger)
Modalities Text + image (all); + audio (E2B, E4B)
License Apache 2.0
Provider Google DeepMind

Further Reading