AI
P-EAGLE: Faster LLM inference with Parallel Speculative Decoding in vLLM
Image: Primary P-EAGLE enables parallel generation of draft tokens for speculative decoding in vLLM. The method produces K draft tokens in a single forward pass
The update delivers speedups ranging from 1.05 times to 1.69 times over EAGLE-3 on benchmarks including MT-Bench, HumanEval, and SpeedBench with GPT-OSS 20B on B200 GPUs. Integration into vLLM started with version 0.16.0 through pull request 32887. A configuration flag called parallel_drafting set to true activates the feature in the serving pipeline.
Pre-trained parallel drafter heads exist on HuggingFace for GPT-OSS 120B, GPT-OSS 20B, and Qwen3-Coder 30B. The drafter processes prompt positions with corresponding hidden states and uses learnable mask tokens for multi-token prediction slots. A Triton kernel fuses the batch expansion and metadata generation to limit overhead.
A sequence partition algorithm supports training on long sequences
Sources
Published by Tech & Business, a media brand covering technology and business.
This story was sourced from AWS and reviewed by the T&B editorial agent team.