← Back to Insights

DiffusionGemma Open-Sourced: Google Applies Image Diffusion to Language Models for 4x Faster Inference

Nils Liu
Google DiffusionGemma 開源模型 文字擴散 AI模型 News

TL;DR

Google DeepMind open-sourced DiffusionGemma 26B-A4B on June 10, 2026, applying image diffusion techniques to text generation: 15–20 tokens per forward pass, 1000+ tokens/sec on H100, 4× faster than comparable autoregressive models. The tradeoff: lower output quality than standard Gemma 4.

DiffusionGemma Open-Sourced: Google Applies Image Diffusion to Language Models for 4x Faster Inference

Google DeepMind uploaded DiffusionGemma 26B-A4B to Hugging Face on June 10, 2026, under an Apache 2.0 license. Anyone can download, modify, and build on it commercially. In raw capability rankings, this model is not at the top. The architecture is what makes it worth examining: it brings image diffusion logic to text generation, replacing the familiar token-by-token autoregressive loop with block-parallel updates.

Speed Numbers

Each forward pass produces 15 to 20 tokens simultaneously. A standard autoregressive model produces one. On an NVIDIA H100, that translates to 1,000+ tokens per second. On a consumer GeForce RTX 5090, the number lands at 700+ tokens per second. Google’s official claim is a 4× speed advantage over autoregressive models of similar scale.

The architecture is a 26B mixture-of-experts model that activates only 3.8B parameters per inference, selecting 8 active experts from a pool of 128. Combined with NVFP4 numerical format, the memory footprint stays well below what the raw parameter count would suggest.

How Text Diffusion Works

Image diffusion starts with noise and progressively denoises toward a target. DiffusionGemma applies the same logic to language: it fills a response frame with random tokens, then iteratively replaces uncertain positions with contextually appropriate words. Every iteration, the entire block updates together.

Standard autoregressive generation has a structural bottleneck: each step depends on the previous one. That sequential dependency makes true parallelism nearly impossible. Block-parallel diffusion sidesteps the dependency by treating the entire generation window as a single refinement target. DiffusionGemma supports context windows up to 256K tokens.

The Tradeoff and History

Speed comes at a cost. Output quality runs below standard Gemma 4, particularly on tasks requiring precise reasoning. Google positions this as an experimental model, and that framing is accurate.

The research lineage goes back to May 2025, when Google released an experimental Gemini Diffusion model that was tested but never publicly shipped. DiffusionGemma rebuilds that work on the Gemma 4 26B-A4B architecture and releases it as open weights. NVIDIA’s NIM platform also hosts the model for free inference.

The open-source decision is notable. A model with a clear throughput advantage over existing alternatives is now in the hands of the broader research community under a permissive license. Text diffusion was a niche research direction in 2025. With concrete benchmark numbers and a downloadable model, the conversation shifts.

The model accepts text, image, and video inputs, covers 35+ languages, and supports context windows up to 256K tokens.

If this was useful, subscribe to the newsletter for weekly AI PM insights and GenAI case studies.

References

Get the latest insights

Join the newsletter to receive my latest articles on GenAI, AI Agents, and architecture.

No spam. Unsubscribe anytime.