NVIDIA releases Nemotron 3 Ultra 550B: open reasoning model with 5x throughput advantage, targeting long-running agents

No audio yetJun 4, 2026published Jun 13, 2026

NVIDIA on June 4, 2026, launched Nemotron 3 Ultra, an open-weights 550-billion-parameter reasoning model designed for enterprise AI agents. The model delivers claimed "5x higher throughput compared to other open models in its class" while matching or beating comparable closed and open models on coding, planning, and long-context benchmarks.

What's new

Nemotron 3 Ultra (model ID: NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4) is NVIDIA's largest and most capable open model release to date. Key specifications:

550B total / 55B active parameters via a Mixture-of-Experts (MoE) architecture
Hybrid Mamba-Transformer design with LatentMoE expert routing — a departure from pure Transformer stacks
NVFP4 quantization for up to 5x throughput improvement on NVIDIA Blackwell GPUs
1M token context window
Multi-Teacher On-Policy Distillation using more than 10 specialized teacher models
License: OpenMDW-1.1 (permissive Linux Foundation license covering architecture, weights, and documentation)

Benchmark highlights:

SWE-bench Verified: 65–70.4% (framework-independent consistency)
Agent Productivity (PinchBench): 91% (tied for top)
Instruction Following (IFBench): 82%
Long Context (Ruler @ 1M tokens): 95%
Long-horizon Planning (EnterpriseOps-Gym): 33%
Up to 30% cost savings on task completion vs. comparably-sized alternatives

Availability:

Weights on Hugging Face
build.nvidia.com and NVIDIA NIM microservice
Perplexity (Pro), OpenRouter, Anaconda
Cloud: AWS, Google Cloud, Azure, Oracle
Inference: SGLang, TRT-LLM, vLLM

The model competes directly with GLM-5.1 (744B), Kimi K2.6 (1T), and Qwen 3.5 (397B), delivering superior inference speed with comparable reasoning accuracy per NVIDIA's benchmarks.

Context

NVIDIA first previewed the Nemotron 3 family in December 2025, with Ultra listed as "expected H1 2026." The June 4 launch delivers on that timeline. The Nemotron 3 Nano (30B, 3B active) and Super (~100B, 10B active) are the smaller family members; Nano has been available since December while Super and Ultra ship now.

The Hybrid Mamba-Transformer design is notable: Mamba is a state-space model architecture that handles long sequences more efficiently than standard attention, and integrating it with Transformer layers is an active research direction. Coupling this with MoE for compute efficiency and NVFP4 for Blackwell hardware optimization reflects NVIDIA's intent to tie its open AI model strategy directly to its newest GPU generation.

Why it matters

Nemotron 3 Ultra is the most capable open-weights model NVIDIA has released, and its 5x throughput claim over comparable open models — if validated by independent benchmarks — is a meaningful differentiator for enterprises running inference at scale on Blackwell hardware.

The OpenMDW-1.1 license is commercially permissive, making the model viable for production deployments without royalty concerns. Availability across AWS, GCP, Azure, and Oracle from day one covers the major enterprise cloud footprints.

For the open-source ecosystem, a 550B-parameter model with a 1M-token context window and strong SWE-bench scores competes squarely with closed frontier models. The question is whether the active 55B-parameter compute profile allows deployment at costs that make it viable for most teams — NVIDIA's cost savings claim implies it does.

Corroborating sources

Developer.nvidia
https://developer.nvidia.com/blog/nvidia-nemotron-3-ultra-powers-faster-more-efficient-reasoning-for-long-running-agents/
“5x higher throughput compared to other open models in its class”

What's new

Nemotron 3 Ultra (model ID: NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4) is NVIDIA's largest and most capable open model release to date. Key specifications:

550B total / 55B active parameters via a Mixture-of-Experts (MoE) architecture

Hybrid Mamba-Transformer design with LatentMoE expert routing — a departure from pure Transformer stacks

NVFP4 quantization for up to 5x throughput improvement on NVIDIA Blackwell GPUs

1M token context window

Multi-Teacher On-Policy Distillation using more than 10 specialized teacher models

License: OpenMDW-1.1 (permissive Linux Foundation license covering architecture, weights, and documentation)

Benchmark highlights:

SWE-bench Verified: 65–70.4% (framework-independent consistency)

Agent Productivity (PinchBench): 91% (tied for top)

Instruction Following (IFBench): 82%

Long Context (Ruler @ 1M tokens): 95%

Long-horizon Planning (EnterpriseOps-Gym): 33%

Up to 30% cost savings on task completion vs. comparably-sized alternatives

Availability:

Weights on Hugging Face

build.nvidia.com and NVIDIA NIM microservice

Perplexity (Pro), OpenRouter, Anaconda

Cloud: AWS, Google Cloud, Azure, Oracle

Inference: SGLang, TRT-LLM, vLLM

The model competes directly with GLM-5.1 (744B), Kimi K2.6 (1T), and Qwen 3.5 (397B), delivering superior inference speed with comparable reasoning accuracy per NVIDIA's benchmarks.

Context

Why it matters