Google DeepMind's DiffusionGemma arrives with NVIDIA day-zero RTX support, generating text 4x faster than autoregressive models
Google DeepMind has released DiffusionGemma, an experimental open-weights language model that generates text using a diffusion process rather than token-by-token prediction, achieving speeds up to four times faster than comparable autoregressive models on identical hardware. NVIDIA announced same-day optimization across its RTX, RTX PRO, and DGX Spark lineup, bringing the model to desktop and workstation users without cloud dependency.
What's new
DiffusionGemma replaces sequential token prediction with block-level denoising. Where standard autoregressive models generate one token at a time—a fundamental throughput bottleneck—DiffusionGemma denoises entire token blocks simultaneously, processing up to 256 tokens per step. On a single NVIDIA H100 GPU, that translates to 1,000 tokens per second in single-user scenarios, roughly four times the throughput of equivalent autoregressive models.
NVIDIA's RTX AI Garage team shipped optimized support on the day of release across three hardware tiers:
- GeForce RTX GPUs — consumer desktop and laptop
- RTX PRO platforms — workstation-class hardware
- DGX Spark — compact data-center unit
The model is released under the Apache 2.0 license, which permits commercial use with no per-token costs or usage restrictions. Day-zero framework support is available in Hugging Face Transformers, vLLM, and Unsloth—the three most widely used open-source inference stacks.
Context
DiffusionGemma extends Google DeepMind's Gemma family of open models—publicly releasable companions to the proprietary Gemini series that have included Gemma 2, Gemma 3, and PaliGemma for vision tasks.
The diffusion architecture itself is borrowed from image generation: Stable Diffusion and similar models start with random noise and iteratively refine it toward a coherent image. Applying that approach to discrete text has been a longstanding research challenge because text lacks the continuous pixel values that make image diffusion straightforward. DiffusionGemma represents one of the first open-weights models to bring a working diffusion approach to text generation with commercial licensing.
The sequential dependency in autoregressive models is not just a hardware problem—it also means the model cannot take advantage of parallel hardware cores during generation. Diffusion breaks that constraint by treating the output as a whole to be refined rather than a sequence to be grown.
Why it matters
The 4x throughput improvement matters most in constrained deployment scenarios: on-device coding assistants, offline document processing, and edge deployments in regulated industries where data cannot leave local hardware. For these use cases, a model that generates 1,000 tokens per second on an RTX GPU changes what's practical.
NVIDIA's decision to ship same-day optimization signals institutional commitment to diffusion-based LLMs as a viable alternative architecture. The choice to support Hugging Face Transformers, vLLM, and Unsloth on day one means developers don't need to adopt new tooling to test the architecture.
The Apache 2.0 license removes the usage restrictions that apply to Google's Gemini commercial APIs, making DiffusionGemma accessible to enterprise developers with permissive-license requirements. Whether diffusion LMs can match frontier autoregressive models on complex reasoning at production scale remains to be demonstrated, but the throughput numbers establish the architecture as worth tracking.
Corroborating sources
- Blogs.nvidia
https://blogs.nvidia.com/blog/rtx-ai-garage-local-gemma-diffusion/
“DiffusionGemma denoises up to 256 tokens per step instead of predicting one at a time.”