Google Unveils DiffusionGemma: A New Paradigm for Text Generation
Google has released DiffusionGemma, an experimental language model that generates text using a diffusion-based method, producing blocks of 256 tokens at once rather than generating text word by word. This approach marks a significant departure from traditional autoregressive models that generate one token at a time, each dependent on the previous one. By processing tokens in parallel, the model makes better use of graphics processors, achieving speeds up to four times faster than traditional models when running in single-user mode on dedicated GPUs.
How DiffusionGemma Works
DiffusionGemma starts with a block of 256 random placeholder tokens and refines them across several passes until readable text emerges. This concept is borrowed from image AI, where diffusion models turn noise into clear images. The model has 26 billion parameters total but only activates 3.8 billion per step, thanks to a mixture-of-experts architecture where several specialized sub-networks sit side by side and only the right ones fire depending on the input. When quantized to lower precision, the model fits into 18 GB of VRAM on high-end consumer GPUs, according to Google. It builds on the Gemma 4 family and borrows its diffusion process from Google's earlier research on Gemini Diffusion.
Performance and Speed Gains
Nvidia, which handled the optimization, explains that the speed advantage comes down to hardware usage. With autoregressive models, single-user inference is often bottlenecked by memory bandwidth. The GPU's compute units sit idle most of the time, just waiting for data from memory—a condition engineers call memory-bound. DiffusionGemma sidesteps the problem by processing up to 256 tokens in parallel, pushing the bottleneck toward raw compute instead. The result is that GPUs actually stay busy. Nvidia reports about 1,000 tokens per second on an H100 when processing a single request, 150 tokens per second on the DGX Spark deskside system, and up to 80 tokens per second on consumer hardware.
Trade-offs: Quality vs. Speed
While DiffusionGemma generates far more tokens per second than the autoregressive Gemma 4 models, it scores slightly lower on accuracy. The generated text quality is lower compared to conventional models, but the approach is particularly well suited for non-linear tasks such as inserting text after the fact or filling in gaps in program code. This makes DiffusionGemma a promising tool for applications where speed and flexibility are prioritized over perfect fluency.
Implications for AI Development
DiffusionGemma represents a new direction in language model architecture, potentially enabling faster inference on existing hardware without requiring specialized accelerators. By making the model open-weight, Google invites the research community to explore and build upon this approach. The mixture-of-experts design also allows for efficient scaling, activating only a fraction of parameters per step, which could lead to more sustainable AI deployment. As the field continues to push for both performance and efficiency, DiffusionGemma offers a compelling alternative to autoregressive models for specific use cases.
This article is based on reporting by The Decoder. Read the original article.
Originally published on the-decoder.com
