Google releases DiffusionGemma, a 26B MoE open model generating text via parallel diffusion at over 1100 tokens per second on H100
Google on June 10, 2026 released DiffusionGemma, an experimental open-weights language model that generates text through a fundamentally different mechanism than standard autoregressive models. While conventional models produce one token at a time from left to right, DiffusionGemma generates entire blocks of tokens simultaneously by iteratively denoising them — a process adapted from image diffusion models applied to discrete text.
What's new
DiffusionGemma is built on the 26B A4B Mixture-of-Experts architecture from Gemma 4, with 3.8 billion active parameters during inference and 128 total experts (8 active per step). The model generates text by:
- Accepting the full prompt context through an autoregressive encoder that caches it efficiently
- Starting a 256-token output canvas filled with noise tokens
- Running multiple diffusion denoising passes over the canvas with bidirectional attention — every token attending to all others simultaneously
- Converging the canvas on coherent output
This block-parallel approach delivers speed benchmarks conventional sequential models cannot match at the same scale: Google reports per-user generation speeds exceeding 1,100 tokens per second in low-batch settings on an H100 with FP8 quantization.
Additional capabilities include:
- Multimodal input: natively accepts text, images (variable aspect ratio and resolution), and video; audio is not supported
- Long context: up to 256K token context window
- Built-in reasoning: configurable thinking mode for step-by-step reasoning before final output
- Vision tasks: document/PDF parsing, chart comprehension, OCR, UI understanding
- VRAM efficiency: fits within 18 GB VRAM when quantized, enabling local consumer GPU deployment
DiffusionGemma is released under an Apache 2.0 license with day-zero support in vLLM, Hugging Face Transformers, MLX, and SGLang.
Context
Diffusion text models have been a research direction for several years but have struggled to match autoregressive models on open-ended generation quality. DiffusionGemma is notable for being the first diffusion LLM from a major lab at the 26B scale with full MoE architecture, competitive multimodal capabilities, and an open license.
The model arrives alongside the broader Gemma 4 family and other Google open-weight releases aimed at the developer community. Releasing it as experimental signals Google is publishing to accelerate community research rather than positioning it as production-ready.
Why it matters
For the AI research community, a trained open-weights diffusion LLM at 26B scale from Google is a significant artifact — prior public diffusion text models were substantially smaller or lower quality. Researchers can now study scaling properties of diffusion text generation and probe where throughput gains trade against generation quality.
For practitioners, 1,100 tokens per second on a single H100 at this model scale is exceptional. If generation quality holds for structured tasks (document analysis, reasoning, OCR), DiffusionGemma could justify the architecture shift for latency-critical workloads. The consumer GPU compatibility extends this to teams without data-center access.
Corroborating sources
- Huggingface.co
https://huggingface.co/google/diffusiongemma-26B-A4B-it
“Rather than generating one token at a time, the model iteratively denoises a full block of tokens using a diffusion sampler.”