Google AI Releases DiffusionGemma, a 26B MoE Open Model Using Text Diffusion for Up to 4x Faster Generation

Google AI team including the Google DeepMind researchers have just released DiffusionGemma, an experimental open model for text generation. It uses text diffusion instead of standard autoregressive decoding. The model ships under a permissive Apache 2.0 license. Google positions it for devs and researchers exploring speed-critical, interactive local workflows. Examples include in-line editing, rapid iteration, and generating non-linear text structures.

Most language models in use today are autoregressive. They generate one token at a time, left to right. Each new token depends on the token before it. DiffusionGemma works differently. It generates entire blocks of text simultaneously, in parallel. On dedicated GPUs, this delivers up to 4x faster generation.

What is DiffusionGemma

DiffusionGemma is a 26B Mixture of Experts (MoE) model. It activates only 3.8B parameters during inference. It is built on the Gemma 4 backbone, specifically the 26B-A4B architecture. Google integrated a diffusion head onto that base.

The model is multimodal. It processes interleaved text, image, and video inputs. It generates text outputs from those inputs. The context window is 256K tokens, and it supports 140+ languages.

Quantized, the model fits within 18GB of VRAM. That places it inside high-end consumer GPU limits. On a single NVIDIA H100, it reaches 1000+ tokens per second. On an NVIDIA GeForce RTX 5090, it reaches 700+ tokens per second.

Google is very direct about the trade-off. DiffusionGemma prioritizes speed and parallel layout generation. Its overall output quality is lower than standard Gemma 4. For maximum quality production work, Google still recommends autoregressive Gemma 4.

How Text Diffusion Works

Text diffusion borrows its core idea from AI image generators. Those models start with visual static and refine it iteratively. DiffusionGemma applies the same pattern to text generation.

The process runs in three conceptual stages. First, the model starts with a canvas of random placeholder tokens. Second, it makes multiple passes over that canvas. It locks in high-confidence tokens and uses them as context. Third, the text converges into the final output.

Google calls the core mechanism Uniform State Diffusion. Highly confident tokens help resolve adjacent positions during denoising. The full sequence then snaps into focus over several passes.

In practice, the model denoises a 256-token canvas in parallel. It finalizes roughly 15-20 tokens per forward pass. That parallelism is what drives the throughput gains.

The model uses bidirectional attention during denoising. Every token on the canvas can attend to every other token. This is a sharp break from autoregressive models. Those models can only look backward at prior tokens.

That bidirectional context enables real-time self-correction. If a token’s confidence drops, the sampler can re-noise it. The model then replaces that token on a later pass. Autoregressive models cannot do this, since they commit each token once.

The Architecture

The technical advancement here is hardware utilization. For local GPU inference, the main bottleneck is memory bandwidth. Autoregressive models repeatedly load weights from memory per token. During single-user serving, the GPU spends most time waiting.

DiffusionGemma shifts the bottleneck from memory bandwidth to compute. It drafts and refines a 256-token canvas in parallel. This gives idle tensor cores a large parallel workload.

The model alternates two attention modes during inference. Prefill uses causal attention to ingest the prompt and write the KV cache. Denoising uses bidirectional attention to refine the canvas.

For longer outputs, DiffusionGemma uses Block Autoregressive Diffusion. Once a 256-token block is fully denoised, it commits to the KV cache. The model then starts a fresh canvas conditioned on prior history. This pairs parallel block speed with sequential autoregressive stability.

The architecture shares the same backbone as Gemma 4 26B A4B. Developers mainly need to implement a denoising step. That makes integration into existing serving frameworks simpler.

A clear example is the Sudoku showcase from Google’s developer guide. Autoregressive models struggle with strict, multivariable constrained puzzles. The base DiffusionGemma model solves roughly 0% of Sudoku puzzles. After a simple JAX supervised fine-tuning recipe, correctness rises to 80%. The fine-tuned model also stops earlier, cutting inference steps.

Interactive Demo: How DiffusionGemma Decodes in Parallel

The interactive visualizer below illustrates how DiffusionGemma decodes text, contrasted with a standard autoregressive model. Toggle between the two modes and press Run. In Autoregressive mode, tokens fill in one at a time, strictly left to right, taking one forward pass per token — the way most LLMs generate today. In Diffusion mode, the model starts from a canvas of masked placeholder tokens and resolves many of them in parallel each pass, in no fixed order, converging in far fewer passes. The animation also shows a brief re-noise step, where a low-confidence token is reset and refined again — a stand-in for the real model’s self-correction, which autoregressive decoding cannot do once a token is committed. Note this is a conceptual animation, not live model output: the real DiffusionGemma resolves a 256-token canvas and finalizes roughly 15–20 tokens per forward pass.

Interactive · Illustrative
Watch DiffusionGemma Decode in Parallel
This is a conceptual animation of the denoising process — not live model output. The real model resolves a 256-token canvas, finalizing ~15–20 tokens per forward pass.

0Forward passes
0 / 16Tokens resolved
DiffusionDecoding mode


Press Run to start.
Marktechpost
Practitioner-first AI/ML coverage — deep dives, model releases, and research, decoded for builders.

Use Cases

DiffusionGemma targets specific workloads, not general production quality. Google and ecosystem partners highlight several practical applications:

In-line editing and code infilling: Bidirectional attention suits non-linear text structures well.

Rapid iteration: Low local latency supports interactive, single-user developer loops.

Long-context document analysis: The 256K window supports large input processing.

OCR and document parsing: Multimodal input handles images and scanned documents.

Code generation, tool calling, and agentic workflows: Unsloth lists these as supported tasks.

Constrained generation: Sudoku, mathematical graphs, and amino acid sequences benefit from parallel attention.

One caveat shapes all of these. The speedup is designed for local, low-concurrency inference. In high-QPS cloud serving, autoregressive models saturate compute efficiently. There, parallel decoding offers diminishing returns and can raise serving costs.

https://blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation/

DiffusionGemma vs Standard Gemma 4

AttributeDiffusionGemma (26B-A4B)Standard Gemma 4 (26B A4B)Generation methodDiscrete text diffusion (parallel)Autoregressive (token-by-token)Decode bottleneckCompute-boundMemory-bandwidth-boundParallel unit256-token canvas per passOne token per stepAttention during decodeBidirectionalCausal (backward only)Self-correctionYes, via re-noisingNo, tokens are committed onceSpeed on dedicated GPUUp to 4x fasterBaselineH100 throughput1000+ tokens/secLower (baseline)RTX 5090 throughput700+ tokens/secLower (baseline)Output qualityLower than Gemma 4Higher; recommended for productionBest fitLocal, low-concurrency, interactiveHigh-quality and high-QPS cloud servingLicenseApache 2.0Gemma terms

Key Takeaways

DiffusionGemma is a 26B MoE open model (3.8B active) that generates text via parallel diffusion, not token-by-token.

It runs up to 4x faster on dedicated GPUs: 1000+ tokens/sec on H100, 700+ on RTX 5090.

Bidirectional attention over a 256-token canvas enables real-time self-correction, unlike autoregressive models.

Quantized, it fits in 18GB VRAM with day-zero support in vLLM, Transformers, MLX, and Unsloth.

It’s experimental and lower-quality than standard Gemma 4; Google recommends Gemma 4 for production.

Marktechpost’s Visual Explainer

Open Model · Apache 2.0
DiffusionGemma: A Visual Guide
Google DeepMind’s 26B open text diffusion model — what it is and how it works.
1
What DiffusionGemma Is
An experimental open model that generates text via diffusion, not token-by-token.

26B Mixture of Experts (MoE) that activates only 3.8B parameters during inference.
Built on the Gemma 4 backbone (26B-A4B) with a diffusion head added.
Multimodal input — text, image, and video — generating text output.
256K context window, 140+ languages, released under Apache 2.0.

2
The Core Idea
Most LLMs are autoregressive. DiffusionGemma takes a different path.

Autoregressive models generate one token at a time, left to right.
Each new token depends on the token before it.
DiffusionGemma generates entire blocks of text simultaneously, in parallel.
On dedicated GPUs, this delivers up to 4x faster generation.

3
How Text Diffusion Works
It borrows from image diffusion: start with noise, refine iteratively.
1The canvas: the model starts with random placeholder tokens.
2Iterative refinement: it locks in confident tokens, using them as context.
3Final polish: the text converges into the output.

Google calls the mechanism Uniform State Diffusion.
It finalizes ~15–20 tokens per forward pass over a 256-token canvas.

4
The Architecture
The win is hardware utilization on local GPUs.

Shifts the bottleneck from memory bandwidth to compute.
Prefill uses causal attention to write the KV cache.
Denoising uses bidirectional attention to refine the canvas.
Block Autoregressive Diffusion handles sequences longer than 256 tokens.
Bidirectional context enables real-time self-correction via re-noising.

5
Performance & Footprint
Throughput numbers and hardware limits from Google.

1000+ tokens/sec on a single NVIDIA H100.
700+ tokens/sec on an NVIDIA GeForce RTX 5090.
Fits within 18GB VRAM when quantized.
Native NVFP4 (4-bit floating-point) with near-lossless accuracy.
Speedup is designed for local, low-concurrency inference.

6
DiffusionGemma vs Standard Gemma 4

AttributeDiffusionGemmaGemma 4

GenerationDiffusion (parallel)Autoregressive
BottleneckCompute-boundMemory-bandwidth
AttentionBidirectionalCausal
Self-correctionYes (re-noising)No
Speed (GPU)Up to 4x fasterBaseline
Output qualityLowerHigher (production)

7
Use Cases
Built for specific workloads, not general production quality.

In-line editing and code infilling — suited to non-linear text.
Long-context analysis, OCR, and document parsing.
Code generation, tool calling, and agentic workflows.
Constrained generation — Sudoku rose 0% to 80% after fine-tuning.

8
Availability & Tooling
Open weights with day-zero ecosystem support.

Weights on Hugging Face: google/diffusiongemma-26B-A4B-it.
The first diffusion LLM natively supported in vLLM.
Also Transformers, MLX, and Unsloth; NeMo fine-tuning; llama.cpp soon.
Deploy via Google Cloud Model Garden or NVIDIA NIM.


1 / 8
Marktechpost
Practitioner-first AI/ML coverage — deep dives, model releases, and research, decoded for builders.

Check out the Model weights and Technical detailsWe have also created a short demo for this research paper. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post Google AI Releases DiffusionGemma, a 26B MoE Open Model Using Text Diffusion for Up to 4x Faster Generation appeared first on MarkTechPost.

By

Leave a Reply

Your email address will not be published. Required fields are marked *