Google Gemma

llm open-source self-hosted multimodal reasoning agentic

Jun 2026

Trial

Gemma 4 (released April 2, 2026) is Google DeepMind's fourth-generation open-weight model family — four variants from 2.3B to 31B parameters, all multimodal, all under Apache 2.0, with a 31B Dense model that scores 89.2% on AIME 2026 and a 26B MoE variant that matches it at 88.3% using only 3.8B active parameters.

Why It's in Trial

Gemma 4 earns Trial on strong benchmarks, an agentic-first architecture, and a license upgrade that removes the main commercial friction of prior generations:

Massive benchmark leap: Gemma 4 31B scores 89.2% on AIME 2026 vs. 20.8% for Gemma 3 27B. Codeforces ELO jumps from 110 to 2150. LiveCodeBench nearly triples. These are not incremental improvements.
Apache 2.0 license: First time in the Gemma lineage. No custom clauses, no "Harmful Use" carve-outs, no redistribution restrictions. This removes the enterprise legal friction that kept Gemma 3 at arm's length for many teams.
MoE efficiency: The 26B A4B MoE activates only 3.8B parameters at inference — near-E4B latency with near-31B quality on reasoning benchmarks.
Agentic focus: Thinking mode produces 4,000+ token step-by-step reasoning chains before committing to output. The hybrid attention architecture (local sliding window interleaved with full global attention) is tuned for long-context agentic tasks.
Audio on edge models: E2B and E4B support native audio input (speech recognition and understanding), not just text+image.

Stays in Trial rather than Adopt because: released April 2, 2026 — adoption data is not yet available. The "3+ named orgs in production" bar requires time. Benchmark improvements are extraordinary; production evidence will follow.

When NOT to Use Gemma 4

Frontier coding agents: Gemma 4 trails Qwen 3.5, GLM-5, and Kimi K2.5 on some benchmarks despite the improvements. For highest-stakes agentic coding, these remain competitive alternatives.
Llama 4 Scout for very long context: Llama 4 Scout offers a 10M token context window; Gemma 4's 256K cap is sufficient for most workloads but not exhaustive document corpora.
Fine-tuning pipelines already on Gemma 3: The architecture changes (MoE, hybrid attention) mean fine-tuning workflows don't transfer directly.

Gemma 4 Model Family

Variant	Effective Params	Active Params	Context	Modalities	Best For
E2B	2.3B	2.3B	128K	text + image + audio	Mobile/IoT, speech apps
E4B	4.5B	4.5B	128K	text + image + audio	Edge servers, mid-tier mobile
26B A4B	26B total	3.8B (MoE)	256K	text + image	Low-latency inference, production APIs
31B Dense	31B	31B	256K	text + image	Quality-first workloads, fine-tuning

Architecture

Gemma 4 introduces two significant architectural changes from Gemma 3:

Hybrid attention: Interleaves local sliding window attention with full global attention layers. The final layer is always global. This delivers small-model memory footprint with large-model long-context awareness — particularly relevant for agentic tasks with long tool-call histories.
Thinking mode: All four variants support thinking mode, triggered via enable_thinking=True in the API or by including the <|think|> token in the system prompt. When enabled, models produce extended reasoning chains (4,000+ tokens) before committing to output. On the larger models (26B, 31B), the thinking scaffolding is always present — even when disabled, they emit empty thinking tags — while E2B and E4B cleanly suppress them. The capability is baked into the base model rather than a separate fine-tune. (Google AI docs: thinking mode)

The 26B A4B uses standard MoE routing with 3.8B active parameters out of 26B total — comparable token throughput to E4B but with near-31B quality on reasoning tasks.

Benchmarks

Benchmark	E2B	E4B	26B A4B	31B Dense	Gemma 3 27B
AIME 2026	37.5%	42.5%	88.3%	89.2%	20.8%
LiveCodeBench	44.0%	52.0%	—	80%	—
MMLU Pro	—	—	—	85.2%	—
GPQA	—	—	—	84%	—
Codeforces ELO	—	—	—	2150	110

Sources: Google DeepMind blog, ai.rs comparison. AIME/Codeforces figures reflect thinking mode.

License & Deployment

Apache 2.0 — commercially permissive, no custom carve-outs (departure from prior Gemma License)
Weights on Hugging Face: google/gemma-4-*
NVIDIA-optimized builds for RTX local inference (via NVIDIA TensorRT-LLM)
ARM-optimized builds for on-device mobile deployment
Unsloth support for local quantized inference

Key Characteristics

Property	Value
Parameter range	2.3B–31B (effective); 3.8B active for 26B MoE
Latest generation	Gemma 4 (April 2, 2026)
Architecture	Dense (E2B, E4B, 31B) + MoE (26B A4B)
Attention	Hybrid: local sliding window + global
Context window	128K (edge), 256K (larger)
Modalities	Text + image (all); + audio (E2B, E4B)
License	Apache 2.0
Provider	Google DeepMind