← Back to blog

Google Just Released an AI Model That Generates Text 4x Faster. Here's How.

⭐ Featured

Google Just Released an AI Model That Generates Text 4x Faster. Here's How.

Every AI language model you've used generates text the same way: one token at a time, left to right, waiting for each word before producing the next. It's how GPT-4, Claude, Gemini, and every other major model works. Google DeepMind just shipped something that does it differently.

DiffusionGemma is an open-source experimental model that generates 256 tokens in parallel per pass — not one at a time. The result: text generation that's roughly 4x faster than comparable autoregressive models.

What "Diffusion" Actually Means Here

You've probably heard of diffusion models in the context of image generation — Stable Diffusion, Midjourney, DALL-E. They work by starting with noise and iteratively refining it into a coherent image. DiffusionGemma applies the same core idea to text.

Instead of predicting the next word given all previous words, a diffusion language model starts with a rough draft and progressively refines it. The model looks at all the tokens at once — bidirectional attention — and corrects mistakes as it goes.

This is a fundamentally different architecture. Autoregressive models can only look backwards (they haven't generated the future tokens yet). Diffusion models can look in both directions, which means they have more context when fixing errors.

The Numbers

DiffusionGemma is a 26B parameter Mixture-of-Experts model, but only 3.8B parameters are active during any given inference pass. That's a key efficiency trick — you get the quality of a large model without paying the full compute cost.

After quantization, it fits in 18GB of VRAM — within reach of a consumer GPU like an RTX 4090 or 5090. Benchmarked performance:

  • H100: 1,000+ tokens per second
  • RTX 5090: 700+ tokens per second

For context, most production-grade models running locally top out well below those numbers. The 4x speedup claim holds up.

What It's Designed For

DiffusionGemma isn't positioned as a general-purpose chat model. Google DeepMind calls out specific use cases:

  • Inline editing — rewriting a section of text in place, where the model can see what comes before and after
  • Code fill-in-the-middle — completing a function body when the signature and surrounding code are already written
  • Local interactive workflows — anything where low latency matters and you want to run on your own hardware

The bidirectional attention and self-correction capabilities make it especially well-suited for these tasks. An autoregressive model has to commit to each token before seeing what comes next; DiffusionGemma can revise.

It's Open Source

Released under the Apache 2.0 license, DiffusionGemma is fully open. You can use it commercially, modify it, and run it locally. That's a meaningful commitment from Google DeepMind — Apache 2.0 is one of the most permissive open-source licenses available.

This puts it alongside Gemma's existing open model family, which has become a popular base for local AI development.

What This Means If You Use OpenClaw

Speed matters a lot when you're running AI agents. An agent that completes a task in 10 seconds feels responsive; one that takes 2 minutes feels like a bottleneck.

Right now, most agent workflows are bottlenecked by inference speed — each tool call, each reasoning step, each output requires waiting for the model to generate tokens one at a time. A 4x speedup in the underlying model doesn't just make things faster linearly: it can make entire classes of workflows practical that weren't before.

Local models are increasingly viable for agent use cases, especially for tasks where you want low latency, privacy, or offline capability. DiffusionGemma — fitting in 18GB VRAM, running at 700+ tokens/s on consumer hardware — is exactly the kind of development that makes running your own agent stack more attractive.

OpenClaw is built to work with the models you want to use. As the local model ecosystem improves, the case for running your own persistent AI agent gets stronger.

Start your free trial →