AWS brings P-EAGLE parallelized speculative decoding to SageMaker, delivering up to 1.69x throughput speedup over EAGLE frameworks
Amazon Web Services has made P-EAGLE, a parallelized speculative decoding technique, available through Amazon SageMaker JumpStart. The method accelerates large language model inference by generating multiple draft tokens in a single forward pass, eliminating the sequential bottleneck that limits conventional speculative decoding approaches.
What's new
P-EAGLE — Parallel EAGLE — modifies the speculative decoding pipeline at its most expensive stage. According to AWS, the technique "completely eliminates the nested sequential drafting phase by predicting all speculative draft tokens simultaneously in a single forward pass."
Performance benchmarks published by AWS show:
- Up to 1.69x throughput improvement over vanilla EAGLE frameworks across MT-Bench, HumanEval, and SPEED-Bench Code
- 1.22x faster than EAGLE-3 on HumanEval at concurrency level 1
- 1.41x faster than EAGLE-3 on SPEED-Bench Code at single-concurrency
- Gains sustained at high concurrency levels (c=64)
Four models are available through SageMaker JumpStart with pre-trained P-EAGLE draft heads at launch:
- GPT-OSS-120B
- GPT-OSS-20B
- Qwen3-Coder-30B-A3B-Instruct
- Gemma-4-31B-IT
Deployment is available through JumpStart's one-click path, requiring at minimum an ml.g7e.2xlarge instance. The inference request format is unchanged.
Context
Speculative decoding has become a primary technique for improving transformer inference throughput without retraining. The standard approach uses a small draft model to predict candidate tokens, which the larger target model then batch-verifies — accepting a run of tokens in the time it would otherwise take to generate one.
EAGLE is an established speculative decoding framework that improved acceptance rates by training the draft head on feature-level representations rather than raw token embeddings. P-EAGLE extends this further by breaking the sequential dependency within EAGLE's own drafting loop, generating all K candidate tokens simultaneously rather than one at a time.
Amazon SageMaker JumpStart is AWS's managed deployment path for open-weight models; the P-EAGLE integration extends the technique to customers running hosted inference without custom inference servers.
Why it matters
Throughput gains in the 1.2x–1.7x range translate directly to cost reduction for GPU-bound workloads: batch code generation, continuous document processing, and any inference pipeline running near capacity. For teams already running Qwen3-Coder or Gemma-4 models on SageMaker, P-EAGLE is available without weight changes or API modifications.
The inclusion of Qwen3-Coder-30B-A3B-Instruct — the only coding-specialist model in the launch set — points to AWS targeting developer tooling workloads alongside general inference. As inference cost remains a competitive differentiator among cloud providers, parallelized speculative decoding is likely to expand to additional SageMaker JumpStart models in subsequent updates.
Corroborating sources
- Aws.amazon
https://aws.amazon.com/blogs/machine-learning/parallelize-speculative-decoding-with-p-eagle-on-amazon-sagemaker-ai/
“P-EAGLE completely eliminates the nested sequential drafting phase by predicting all speculative draft tokens simultaneously in a single forward pass.”