vLLM Community Collaboration Newsletter

Created in February, 2026

2026 · vllm speculative-decoding open-source · research

A timeline of community contributions and milestones from the vLLM ecosystem in 2026 — big-feature releases alongside my own engagements as @pymhq, organized by quarter (newest first).

Disclaimer: This newsletter is personal contribution notes from an individual builder's perspective. All topics, observations, and opinions expressed are solely my own.

Q2 2026 (Apr–Jun)

pymhq Contributions

May 14 — vllm#42687 (issue): Gemma-4 fails to start on GPUs with <70GB memory because the default max_num_batched_tokens falls below the multimodal token-size threshold. Capacity-planning footgun for sub-H100 multimodal deployments.
May 13 — vllm#42536 (PR review): Remove verifier model type check in speculative config. Stale guard removed now that the speculator stack handles the check itself. Merged.

May 26 — EAGLE 3.1: Stabilizing the Drafter Against Attention Drift

The EAGLE, vLLM, and TorchSpec teams jointly published EAGLE 3.1, an enhanced speculative-decoding algorithm that tackles attention drift — the instability that degrades EAGLE-3 drafting under different chat templates, long-context inputs, and out-of-distribution system prompts.

What changed in EAGLE 3.1

FC normalization applied after each target hidden state.
Post-norm hidden-state feedback fed into subsequent decoding steps.
Together these stabilize the drafter across deeper speculation levels and varied deployment scenarios.

up to 2× acceptance length vs. EAGLE 3 on long-context inputs 2.03× output tokens/sec Kimi K2.6 @ C=1

On the SPEED-Bench coding dataset with Kimi K2.6, per-user output throughput improves 2.03× at concurrency 1, 1.71× at C=4, and 1.66× at C=16.

Why it matters here: attention drift was a known failure mode for the EAGLE-style draft heads this newsletter tracks — the same family as P-EAGLE. EAGLE 3.1 lands in vLLM v0.22.0.

Resources:

vLLM blog: EAGLE 3.1
TorchSpec training support: lightseekorg/TorchSpec
Draft model: lightseekorg/kimi-k2.6-eagle3.1-mla

Q1 2026 (Jan–Mar)

The launch quarter — vLLM 0.16.0 shipped P-EAGLE on Feb 26, the Model Acceleration SIG kicked off mid-month, and efficient multi-LoRA serving for MoE models landed alongside the release.

pymhq Contributions

Mar 30 — speculators#335 (RFC comment): Use vLLM's extract_hidden_states system (and enable online training). Wires the hidden-states extraction path from vllm#33118 into the speculators training loop — prerequisite for online training of EAGLE-style draft heads.
Mar 12 — speculators#292 (RFC comment): Add P-EAGLE support in training. Carries the parallel-drafting collapse back into the training recipe.
Mar 10 — vllm#36718 (issue): vLLM 0.15.0 startup on H200 fails inside deep_gemm. Production-deployment regression.
Feb 28 — vllm#26679 (comment): Ensure output consistency when using LoRA with Eagle3 Speculative Decoding. Follow-up to the Feb 23 SIG discussion.
Feb 16 — vllm#34643 (issue): vLLM leaves orphan processes when parent dies on OOM. Surfaced during Qwen3 30B Coder testing; raised at the Feb 16 SIG.

Mar 30 — Extracting Hidden States from vLLM

The vLLM team published Extracting hidden states from vLLM (by Fynn Schmitt-Ulms), introducing a native hidden-states extraction system landing in vLLM 0.18.0+. It closes the long-standing gap that forced downstream training libraries to either swap in a separate transformer implementation (losing vLLM's performance) or patch vLLM internals (creating maintenance burden) just to capture intermediate-layer representations.

How it works

Reuses existing Eagle-3 model pathways and the KV Connector API — no bespoke hooks per model.
Stores hidden states in dummy draft-model attention layers via vLLM's paged memory system.
Flexible output sinks: disk or device-to-device transfer.
No overhead on standard inference workloads when the feature is off.

As a concrete sizing example, Qwen3-8B with 8k tokens across 4 layers produces 268 MB of FP16 hidden-state data — manageable for offline capture, and the motivation behind ongoing work on async writes and device-to-device connectors for multi-node training scenarios.

Why it matters for speculative decoding: hidden states are the training signal for EAGLE-style draft heads (including P-EAGLE). A performant, native extraction path is the prerequisite for online training of these draft models — which is exactly what speculators#335 is wiring up on the training side.

Read more:

vLLM blog: Extracting hidden states from vLLM
RFC: vllm#33118 — hidden states extraction

Feb 26 — Speculative Decoding: P-EAGLE Goes Live in vLLM 0.16.0

vLLM 0.16.0 shipped with P-EAGLE support — a meaningful milestone for speculative decoding in production inference. The release bundles several related improvements:

vLLM 0.16.0 highlights for speculative decoding

Unified Parallel Drafting for speculative decoding #32887
Spec decode now works with structured outputs #33374
Penalty application in Model Runner V2 #33251

What is P-EAGLE?

P-EAGLE (Parallel-Drafting EAGLE) rethinks the drafting step in EAGLE-style speculative decoding. Where EAGLE generates K draft tokens through K sequential forward passes, P-EAGLE collapses them into a single forward pass — cutting the overhead of the draft phase significantly. The result is a throughput lift that comes essentially for free once the checkpoints are in place.

1.36× output tokens/sec on GPT-OSS 120B 1.17× output tokens/sec on Qwen3-Coder 30B

Both models match the acceptance length of autoregressive EAGLE-3 in vLLM benchmarks, with no quality regression.

Resources:

Preprint: arxiv.org/abs/2602.01469
HuggingFace checkpoints: amazon/gpt-oss-120b-p-eagle · amazon/Qwen3-Coder-30B-A3B-Instruct-P-EAGLE

Feb 26 — Speeding Up Multi-LoRA Serving for MoE Models

Also on Feb 26, the vLLM team published work on efficient multi-LoRA serving for Mixture-of-Experts models — a problem that grows quickly in complexity as the number of fine-tuned variants scales. The approach targets the specific characteristics of MoE routing to reduce overhead when switching between adapters.

Read more:

Feb 23 — vLLM SIG Model Acceleration: Second Meeting

Feb 23, 2026 — Second Meeting

Explored the interaction between LoRA adapters and EAGLE draft heads — specifically how LoRA fine-tuning affects acceptance length and what that means for speculative decoding pipelines.
Discussed feasibility of integrating these patterns upstream in vLLM.

Feb 16 — vLLM SIG Model Acceleration: Inaugural Meeting (+ vllm#34643)

February marked the launch of the Model Acceleration SIG within the vLLM community — a dedicated working group covering quantization and speculative decoding.

Feb 16, 2026 — Inaugural Meeting (Year of the Horse 🐴)

Discussed the hidden states extraction RFC: #33118 — a prerequisite for enabling EAGLE-style draft heads without model-specific hooks.
Reviewed an OOM-handling bug surfaced during testing with Qwen3 30B Coder: #34643.