Speculative Decoding

Techniques

An inference optimization where a small, fast draft model generates candidate tokens that a larger model then verifies in parallel, speeding up generation without changing output quality.

Speculative decoding is a technique for accelerating autoregressive text generation by using two models: a small, fast draft model and a larger, more capable target model. The draft model generates several candidate tokens quickly. The target model then evaluates all of them in a single forward pass, accepting tokens that match what it would have generated and rejecting those that diverge. Accepted tokens are kept, and generation continues from the first rejected position.

The speedup comes from a fundamental asymmetry in transformer inference. Generating tokens one at a time is slow because each token requires a full forward pass. But verifying multiple tokens at once is fast because the target model can process them in parallel. If the draft model is good enough that most of its tokens get accepted, the target model does fewer total forward passes while producing identical output to what it would have generated on its own.

The technique requires no retraining of either model. The draft model can be a smaller version of the same model family, a quantized version, or even a completely different architecture. The only requirement is that it generates tokens from the same vocabulary. Acceptance rates typically range from 70% to 90% for well-matched draft models, yielding 2x to 3x speedups in practice. Speculative decoding has become standard in production LLM serving infrastructure, particularly for latency-sensitive applications where users are waiting for streaming responses.

Related Terms

Inference Transformer Tokenization

Last updated: March 5, 2026