NVIDIA Nemotron

Jun 2026

Trial

NVIDIA's Nemotron model family -- spanning Llama Nemotron (8B to 253B) and the hybrid Mamba-Transformer Nemotron 3 series (up to 1M context) -- powers the local inference tier in NVIDIA's enterprise AI stack, most notably as the on-device model behind NemoClaw's privacy router.

Why It Moved to Trial

Nemotron 3 Ultra's June 4, 2026 weight release changes the calculus. The prior Assess reasons — "announced but weights not yet available" and "not competitive on frontier benchmarks" — no longer hold:

Ultra weights are now public: Released June 4 via HuggingFace, NVIDIA NIM, vLLM, SGLang, and TensorRT-LLM. BF16 checkpoint and NVFP4 variant both public.
Frontier-tier benchmark scores, now independently corroborated: NVIDIA self-reports SWE-bench Verified 71.9%, GPQA (no tools) 87.0%, LiveCodeBench v6 89.0%, RULER 1M 94.7%. Artificial Analysis — an independent benchmark tracker — measured SWE-bench Verified scores of 65–70.4% across five different agent harnesses (Pi, OpenHands, Hermes, OpenCode, Mini SWE Agent), close enough to NVIDIA's number to validate the headline claim. Their Intelligence Index places it at 48 (9th of 89 models tracked), the highest of any US open-weight model, and they independently clocked throughput above 300 tokens/sec on DeepInfra — versus 50–100 tok/s for comparably sized DeepSeek and Moonshot models.
1M token context: Confirmed in the model card. Nemotron 3 Ultra fills the same long-context niche as Ultra's predecessor Super, but at frontier scale.
Open weights with full recipe: NVIDIA releases training data and the end-to-end recipe alongside the weights — a differentiator over closed-weight models.
Hardware accessible via NVFP4: The NVFP4 variant runs on 8× H100 (vs. 8× B200/GB200 for BF16). Still high, but within reach of mid-scale enterprise GPU fleets.

It stays at Trial rather than Adopt because:

No independent production evidence yet — independent benchmarks now corroborate NVIDIA's numbers, but no real-world deployment reports have surfaced in the days since release
Hardware floor is still extreme even with NVFP4 (8× H100 minimum)
Primary value remains within the NVIDIA ecosystem (NemoClaw, NIM, TensorRT-LLM)

The Llama Nemotron Family (March 2025)

Fine-tuned from Meta's Llama models with NVIDIA's post-training for reasoning, tool calling, and RAG:

Model	Parameters	Base	Use Case
Nemotron Nano	8B	Llama 3.1 8B	PC and edge inference
Nemotron Super	49B	Llama 3.3 70B (distilled)	Single data center GPU
Nemotron Ultra	253B	Llama 3.1 405B (distilled)	Multi-GPU data center

Nemotron Ultra supports 128K context, processes at ~42 tokens/sec, and costs $0.60/M input tokens via NIM. Post-training boosts accuracy by up to 20% over the base Llama models and optimizes inference speed by 5x compared to other open reasoning models.

The Nemotron 3 Family (2025-2026)

NVIDIA's own architecture -- hybrid Mamba-Transformer MoE designed for high-throughput agentic workloads:

Model	Total Params	Active Params	Architecture	Context	Status
Nemotron 3 Nano	31.6B (MoE)	3.2B	Mamba2-Transformer Hybrid	262K	Available
Nemotron 3 Nano Omni	30B-A3B (MoE)	~3B	Mamba-Transformer Hybrid	TBD	Available (April 28, 2026)
Nemotron 3 Super	120B (MoE)	12B	Hybrid Mamba-Transformer MoE	1M	Available (March 2026)
Nemotron 3 Ultra	550B (MoE)	55B	LatentMoE — Mamba-2 + MoE + Attention + MTP	1M	Released June 4, 2026 (OpenMDW-1.1)

Nemotron 3 Nano delivers 4x higher throughput than Nemotron 2 Nano. Nano Omni (released April 28, 2026) extends Nano with native vision, audio, and language in a single model — up to 9× higher throughput than comparable open multimodal models — targeting agentic pipelines that need lightweight perception without separate model calls. Super (released March 2026) offers up to 7x faster, cost-efficient inference and is the primary model for high-throughput agentic workloads. Ultra (released June 4, 2026) is a frontier-scale open-weights model: 550B total parameters / 55B active, LatentMoE architecture (Mamba-2 + MoE + Attention hybrid + Multi-Token Prediction), 1M token context, OpenMDW-1.1 license (open weights, training data, recipe). NVIDIA's self-reported benchmarks: SWE-bench Verified 71.9%, GPQA 87.0%, LiveCodeBench v6 89.0%, RULER (1M) 94.7%, TauBench v3 avg 70.9%. Minimum hardware: 8× H100 (NVFP4) or 8× B200/H200 (BF16). Available via HuggingFace, NVIDIA NIM, vLLM, SGLang, and TensorRT-LLM.

Where It Fits

Nemotron is not a replacement for Claude, GPT, or Grok for general coding tasks. Its sweet spot is:

Privacy routing in NemoClaw -- classifying whether data can leave the device
Air-gapped enterprise environments requiring fully on-premise inference
NVIDIA-stack deployments already using NIM, TensorRT-LLM, or NeMo
Edge/on-device inference where latency and data locality matter

Key Characteristics

Property	Value
Model families	Llama Nemotron (8B–253B), Nemotron 3 (Nano/Super/Ultra)
Architecture	Fine-tuned Llama (Nemotron); LatentMoE Mamba-2+MoE+Attention hybrid (Nemotron 3)
Max context window	1M tokens (Nemotron 3 Super and Ultra)
License	Llama Nemotron: NVIDIA AI Foundation License; Nemotron 3 Ultra: OpenMDW-1.1
Provider	NVIDIA
Deployment	vLLM, SGLang, TensorRT-LLM, NVIDIA NIM, HuggingFace