DevRadar
🤗 HuggingFaceSignificant

Gemma 4 MTP Speculative Decoding Delivers Up to 3x Inference Speedup

Gemma 4 has received speculative decoding acceleration via MTP (Main-Thought-Parallel) drafters, achieving up to 3x tokens/sec throughput improvement while maintaining identical output quality. Day-0 support confirmed across HuggingFace transformers, Apple's MLX framework (for Apple Silicon), and vLLM inference engine. Released under Apache 2.0 license.

merveTuesday, May 5, 2026Original source

Gemma 4 MTP Speculative Decoding Delivers Up to 3x Inference Speedup

Summary

Gemma 4 now supports speculative decoding through MTP (Main-Thought-Parallel) drafters, enabling up to 3x throughput improvement in tokens-per-second while maintaining identical output quality. Day-0 support is available across HuggingFace transformers, Apple's MLX framework, and vLLM inference engine. Released under Apache 2.0 license.

Integration Strategy

When to Use This?

Speculative decoding with MTP drafters is most effective when:

  • High-throughput scenarios: Batch inference, API serving, document processing pipelines
  • Long-form generation: Tasks requiring extended context (summarization, generation, code completion)
  • Apple Silicon deployments: MLX implementation targets the memory bandwidth limitations of unified architecture
  • Cost-sensitive production: When GPU compute is the bottleneck, 3x throughput translates directly to 3x cost reduction per token

Not ideal for: Single-token latency-critical applications (real-time autocomplete, streaming UI) where speculative decoding's pipelining overhead may negate throughput gains.

How to Integrate?

HuggingFace Transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-9b-it",
    device_map="auto",
    speculative_decoding=True  # Enable MTP drafters
)

vLLM (Production Serving):

vllm serve google/gemma-4-9b-it \
    --speculative-model google/gemma-4-9b-it \
    --speculative-method mtp

MLX (Apple Silicon):

import mlx.core as mx
from mlx_lm import load

model, tokenizer = load(
    "mlx-community/gemma-4-9b-it-mlx",
    speculative_decoding=True
)

Compatibility

ComponentRequirement
PyTorchStandard Gemma 4 requirements
CUDAvLLM requires CUDA 11.8+ for speculative decoding
Apple SiliconMLX optimized for M1/M2/M3/M4 series
MemorySpeculative decoding increases VRAM by ~15-25% (drafter overhead)

Note: Detailed compatibility matrices for specific Gemma 4 model sizes (7B, 9B, 27B) were not available at time of analysis.

Source: @huggingface Reference: Gemma 4 Speculative Decoding Announcement Published: 2025 DevRadar Analysis Date: 2026-05-05