NVIDIA releases Nemotron 3 Ultra 550B: open reasoning model with 5x throughput advantage, targeting long-running agents
NVIDIA on June 4, 2026, launched Nemotron 3 Ultra, an open-weights 550-billion-parameter reasoning model designed for enterprise AI agents. The model delivers claimed "5x higher throughput compared to other open models in its class" while matching or beating comparable closed and open models on coding, planning, and long-context benchmarks.
What's new
Nemotron 3 Ultra (model ID: NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4) is NVIDIA's largest and most capable open model release to date. Key specifications:
- 550B total / 55B active parameters via a Mixture-of-Experts (MoE) architecture
- Hybrid Mamba-Transformer design with LatentMoE expert routing — a departure from pure Transformer stacks
- NVFP4 quantization for up to 5x throughput improvement on NVIDIA Blackwell GPUs
- 1M token context window
- Multi-Teacher On-Policy Distillation using more than 10 specialized teacher models
- License: OpenMDW-1.1 (permissive Linux Foundation license covering architecture, weights, and documentation)
Benchmark highlights:
- SWE-bench Verified: 65–70.4% (framework-independent consistency)
- Agent Productivity (PinchBench): 91% (tied for top)
- Instruction Following (IFBench): 82%
- Long Context (Ruler @ 1M tokens): 95%
- Long-horizon Planning (EnterpriseOps-Gym): 33%
- Up to 30% cost savings on task completion vs. comparably-sized alternatives
Availability:
- Weights on Hugging Face
- build.nvidia.com and NVIDIA NIM microservice
- Perplexity (Pro), OpenRouter, Anaconda
- Cloud: AWS, Google Cloud, Azure, Oracle
- Inference: SGLang, TRT-LLM, vLLM
The model competes directly with GLM-5.1 (744B), Kimi K2.6 (1T), and Qwen 3.5 (397B), delivering superior inference speed with comparable reasoning accuracy per NVIDIA's benchmarks.
Context
NVIDIA first previewed the Nemotron 3 family in December 2025, with Ultra listed as "expected H1 2026." The June 4 launch delivers on that timeline. The Nemotron 3 Nano (30B, 3B active) and Super (~100B, 10B active) are the smaller family members; Nano has been available since December while Super and Ultra ship now.
The Hybrid Mamba-Transformer design is notable: Mamba is a state-space model architecture that handles long sequences more efficiently than standard attention, and integrating it with Transformer layers is an active research direction. Coupling this with MoE for compute efficiency and NVFP4 for Blackwell hardware optimization reflects NVIDIA's intent to tie its open AI model strategy directly to its newest GPU generation.
Why it matters
Nemotron 3 Ultra is the most capable open-weights model NVIDIA has released, and its 5x throughput claim over comparable open models — if validated by independent benchmarks — is a meaningful differentiator for enterprises running inference at scale on Blackwell hardware.
The OpenMDW-1.1 license is commercially permissive, making the model viable for production deployments without royalty concerns. Availability across AWS, GCP, Azure, and Oracle from day one covers the major enterprise cloud footprints.
For the open-source ecosystem, a 550B-parameter model with a 1M-token context window and strong SWE-bench scores competes squarely with closed frontier models. The question is whether the active 55B-parameter compute profile allows deployment at costs that make it viable for most teams — NVIDIA's cost savings claim implies it does.
Corroborating sources
- Developer.nvidia
https://developer.nvidia.com/blog/nvidia-nemotron-3-ultra-powers-faster-more-efficient-reasoning-for-long-running-agents/
“5x higher throughput compared to other open models in its class”