Google releases DiffusionGemma, a 26B MoE open model generating text via parallel diffusion at over 1100 tokens per second on H100

No audio yetJun 10, 2026published Jun 19, 2026

Google on June 10, 2026 released DiffusionGemma, an experimental open-weights language model that generates text through a fundamentally different mechanism than standard autoregressive models. While conventional models produce one token at a time from left to right, DiffusionGemma generates entire blocks of tokens simultaneously by iteratively denoising them — a process adapted from image diffusion models applied to discrete text.

What's new

DiffusionGemma is built on the 26B A4B Mixture-of-Experts architecture from Gemma 4, with 3.8 billion active parameters during inference and 128 total experts (8 active per step). The model generates text by:

Accepting the full prompt context through an autoregressive encoder that caches it efficiently
Starting a 256-token output canvas filled with noise tokens
Running multiple diffusion denoising passes over the canvas with bidirectional attention — every token attending to all others simultaneously
Converging the canvas on coherent output

This block-parallel approach delivers speed benchmarks conventional sequential models cannot match at the same scale: Google reports per-user generation speeds exceeding 1,100 tokens per second in low-batch settings on an H100 with FP8 quantization.

Additional capabilities include:

Multimodal input: natively accepts text, images (variable aspect ratio and resolution), and video; audio is not supported
Long context: up to 256K token context window
Built-in reasoning: configurable thinking mode for step-by-step reasoning before final output
Vision tasks: document/PDF parsing, chart comprehension, OCR, UI understanding
VRAM efficiency: fits within 18 GB VRAM when quantized, enabling local consumer GPU deployment

DiffusionGemma is released under an Apache 2.0 license with day-zero support in vLLM, Hugging Face Transformers, MLX, and SGLang.

Context

Diffusion text models have been a research direction for several years but have struggled to match autoregressive models on open-ended generation quality. DiffusionGemma is notable for being the first diffusion LLM from a major lab at the 26B scale with full MoE architecture, competitive multimodal capabilities, and an open license.

The model arrives alongside the broader Gemma 4 family and other Google open-weight releases aimed at the developer community. Releasing it as experimental signals Google is publishing to accelerate community research rather than positioning it as production-ready.

Why it matters

For the AI research community, a trained open-weights diffusion LLM at 26B scale from Google is a significant artifact — prior public diffusion text models were substantially smaller or lower quality. Researchers can now study scaling properties of diffusion text generation and probe where throughput gains trade against generation quality.

For practitioners, 1,100 tokens per second on a single H100 at this model scale is exceptional. If generation quality holds for structured tasks (document analysis, reasoning, OCR), DiffusionGemma could justify the architecture shift for latency-critical workloads. The consumer GPU compatibility extends this to teams without data-center access.

Corroborating sources

Huggingface.co
https://huggingface.co/google/diffusiongemma-26B-A4B-it
“Rather than generating one token at a time, the model iteratively denoises a full block of tokens using a diffusion sampler.”

What's new

Accepting the full prompt context through an autoregressive encoder that caches it efficiently

Starting a 256-token output canvas filled with noise tokens

Running multiple diffusion denoising passes over the canvas with bidirectional attention — every token attending to all others simultaneously

Converging the canvas on coherent output

Additional capabilities include:

Multimodal input: natively accepts text, images (variable aspect ratio and resolution), and video; audio is not supported

Long context: up to 256K token context window

Built-in reasoning: configurable thinking mode for step-by-step reasoning before final output

Vision tasks: document/PDF parsing, chart comprehension, OCR, UI understanding

VRAM efficiency: fits within 18 GB VRAM when quantized, enabling local consumer GPU deployment

DiffusionGemma is released under an Apache 2.0 license with day-zero support in vLLM, Hugging Face Transformers, MLX, and SGLang.

Context

Why it matters