# P-EAGLE: Faster LLM inference with Parallel Speculative Decoding in vLLM

_Friday, June 26, 2026 at 8:19 PM EDT · AI · Latest · Tier 2 — Notable_

![P-EAGLE: Faster LLM inference with Parallel Speculative Decoding in vLLM — Primary](https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2026/03/13/ml-20619-1120x630.png)

P-EAGLE enables parallel generation of draft tokens for speculative decoding in vLLM. The method produces K draft tokens in a single forward pass by using hidden states from the target model and shared mask embeddings for additional positions. This replaces the autoregressive drafting in EAGLE, which required one forward pass per token.

The update delivers speedups ranging from 1.05 times to 1.69 times over EAGLE-3 on benchmarks including MT-Bench, HumanEval, and SpeedBench with GPT-OSS 20B on B200 GPUs. Integration into vLLM started with version 0.16.0 through pull request 32887. A configuration flag called parallel_drafting set to true activates the feature in the serving pipeline.

Pre-trained parallel drafter heads exist on HuggingFace for GPT-OSS 120B, GPT-OSS 20B, and Qwen3-Coder 30B. The drafter processes prompt positions with corresponding hidden states and uses learnable mask tokens for multi-token prediction slots. A Triton kernel fuses the batch expansion and metadata generation to limit overhead.

A sequence partition algorithm supports training on long sequences by splitting positions while preserving attention dependencies. The changes also extend CUDA graph capture ranges to accommodate the parallel slots.

## Sources

- [AWS](https://aws.amazon.com/blogs/machine-learning/p-eagle-faster-llm-inference-with-parallel-speculative-decoding-in-vllm/)

---
Canonical: https://techandbusiness.org/newswire/WMYow9Ig064KslncDOOprY
Retrieved: 2026-06-27T04:58:02.989Z
Publisher: Tech & Business (techandbusiness.org)
