DiffusionGemma Open-Sourced: Google Applies Image Diffusion to Language Models for 4x Faster Inference
TL;DR
Google DeepMind open-sourced DiffusionGemma 26B-A4B on June 10, 2026, applying image diffusion techniques to text generation: 15–20 tokens per forward pass, 1000+ tokens/sec on H100, 4× faster than comparable autoregressive models. The tradeoff: lower output quality than standard Gemma 4.
Google DeepMind uploaded DiffusionGemma 26B-A4B to Hugging Face on June 10, 2026, under an Apache 2.0 license. Anyone can download, modify, and build on it commercially. In raw capability rankings, this model is not at the top. The architecture is what makes it worth examining: it brings image diffusion logic to text generation, replacing the familiar token-by-token autoregressive loop with block-parallel updates.
Speed Numbers
Each forward pass produces 15 to 20 tokens simultaneously. A standard autoregressive model produces one. On an NVIDIA H100, that translates to 1,000+ tokens per second. On a consumer GeForce RTX 5090, the number lands at 700+ tokens per second. Google’s official claim is a 4× speed advantage over autoregressive models of similar scale.
The architecture is a 26B mixture-of-experts model that activates only 3.8B parameters per inference, selecting 8 active experts from a pool of 128. Combined with NVFP4 numerical format, the memory footprint stays well below what the raw parameter count would suggest.
How Text Diffusion Works
Image diffusion starts with noise and progressively denoises toward a target. DiffusionGemma applies the same logic to language: it fills a response frame with random tokens, then iteratively replaces uncertain positions with contextually appropriate words. Every iteration, the entire block updates together.
Standard autoregressive generation has a structural bottleneck: each step depends on the previous one. That sequential dependency makes true parallelism nearly impossible. Block-parallel diffusion sidesteps the dependency by treating the entire generation window as a single refinement target. DiffusionGemma supports context windows up to 256K tokens.
The Tradeoff and History
Speed comes at a cost. Output quality runs below standard Gemma 4, particularly on tasks requiring precise reasoning. Google positions this as an experimental model, and that framing is accurate.
The research lineage goes back to May 2025, when Google released an experimental Gemini Diffusion model that was tested but never publicly shipped. DiffusionGemma rebuilds that work on the Gemma 4 26B-A4B architecture and releases it as open weights. NVIDIA’s NIM platform also hosts the model for free inference.
The open-source decision is notable. A model with a clear throughput advantage over existing alternatives is now in the hands of the broader research community under a permissive license. Text diffusion was a niche research direction in 2025. With concrete benchmark numbers and a downloadable model, the conversation shifts.
The model accepts text, image, and video inputs, covers 35+ languages, and supports context windows up to 256K tokens.
If this was useful, subscribe to the newsletter for weekly AI PM insights and GenAI case studies.
References
Related Articles
The DNS for AI Agents is Here: Google and 11 Tech Giants Launch ARD Open Standard
Google, Microsoft, and Hugging Face have jointly released the ARD (Agentic Resource Discovery) specification on June 17, 2026. AI agents can now discover tools dynamically at runtime using natural language queries — the same paradigm shift that DNS brought to web browsing, but for the agent ecosystem.
Google Antigravity CLI Arrives as Gemini CLI Shuts Down
Google Antigravity CLI officially replaces Gemini CLI today, cutting off free users immediately. An Apache 2.0 open-source tool absorbed into a closed platform, completing the fully proprietary AI coding tool market.