NVIDIA Nemotron

Mar 2026

Assess

NVIDIA's Nemotron model family -- spanning Llama Nemotron (8B to 253B) and the hybrid Mamba-Transformer Nemotron 3 series (up to 1M context) -- powers the local inference tier in NVIDIA's enterprise AI stack, most notably as the on-device model behind NemoClaw's privacy router.

Why It's in Assess

Nemotron occupies a specific and important niche in the enterprise AI landscape:

Two complementary model families: Llama Nemotron (reasoning-optimized fine-tunes of Meta's Llama) and Nemotron 3 (NVIDIA's own hybrid Mamba-Transformer MoE architecture with up to 1M token context)
NemoClaw integration: Nemotron on-device models power the privacy router in NemoClaw, NVIDIA's enterprise agent platform -- sensitive data stays local while non-sensitive requests route to cloud frontier models (see the OpenClaw & NemoClaw deep dive entry)
Open weights with recipes: NVIDIA releases not just weights but training data and recipes, enabling enterprise customisation
Hardware-optimized: TensorRT-LLM, NVIDIA NIM, and native vLLM/SGLang/Ollama support

It sits in Assess rather than Trial because:

Not competitive on frontier coding benchmarks -- these are infrastructure models, not coding models
Nemotron 3 Super (120B) and Ultra are not yet available (expected H1 2026)
Primary value is within the NVIDIA ecosystem (NemoClaw, NIM, TensorRT-LLM)

The Llama Nemotron Family (March 2025)

Fine-tuned from Meta's Llama models with NVIDIA's post-training for reasoning, tool calling, and RAG:

Model	Parameters	Base	Use Case
Nemotron Nano	8B	Llama 3.1 8B	PC and edge inference
Nemotron Super	49B	Llama 3.3 70B (distilled)	Single data center GPU
Nemotron Ultra	253B	Llama 3.1 405B (distilled)	Multi-GPU data center

Nemotron Ultra supports 128K context, processes at ~42 tokens/sec, and costs $0.60/M input tokens via NIM. Post-training boosts accuracy by up to 20% over the base Llama models and optimizes inference speed by 5x compared to other open reasoning models.

The Nemotron 3 Family (2025-2026)

NVIDIA's own architecture -- hybrid Mamba-Transformer MoE designed for high-throughput agentic workloads:

Model	Parameters	Architecture	Context
Nemotron 3 Nano	4B	Mamba2-Transformer Hybrid	262K
Nemotron 3 Super	120B MoE (12B active)	Hybrid Mamba-Transformer MoE	1M
Nemotron 3 Ultra	TBD	Hybrid Mamba-Transformer MoE	TBD

Nemotron 3 Nano delivers 4x higher throughput than Nemotron 2 Nano. Super offers up to 7x faster, cost-efficient inference. Both Super and Ultra are expected in H1 2026.

Where It Fits

Nemotron is not a replacement for Claude, GPT, or Grok for general coding tasks. Its sweet spot is:

Privacy routing in NemoClaw -- classifying whether data can leave the device
Air-gapped enterprise environments requiring fully on-premise inference
NVIDIA-stack deployments already using NIM, TensorRT-LLM, or NeMo
Edge/on-device inference where latency and data locality matter

Key Characteristics

Property	Value
Model families	Llama Nemotron (8B-253B), Nemotron 3 (4B-120B+)
Architecture	Fine-tuned Llama (Nemotron); Hybrid Mamba-Transformer MoE (Nemotron 3)
Max context window	1M tokens (Nemotron 3 Super)
License	Open weights (NVIDIA AI Foundation License)
Provider	NVIDIA
Deployment	vLLM, SGLang, Ollama, llama.cpp, NVIDIA NIM