MLSys 2026: Speculative Decoding, Invited Talks & What's Next
MLSys 2026 (May 18–21, Bellevue, WA) just wrapped. This is my post-conference reading guide: the speculative-decoding papers worth your time grouped by theme, the invited-talk lineup, and a step back to think about where the next innovation cycle is heading.
Disclaimer: This is a personal reading guide from an individual builder's perspective. All groupings, summaries, and opinions are my own; headline numbers are quoted from the papers' own abstracts.
Speculative Decoding @ MLSys 2026
Speculative decoding (SD) is one of the loudest themes at MLSys 2026 — nine papers across the program, and the conversation has clearly shifted. It's no longer "does draft-then-verify work?" but "where does it actually pay off, and where is it an illusion?" Below I group the papers by the problem they attack. Each links to its MLSys page and arXiv preprint.
Note: SemiAnalysis's MLSys 2026 preview flags the same currents from the analyst side — speculative decoding for RL's long-tail rollouts, sparse attention moving into production, hardware/software co-design via disaggregation, and MoE-serving challenges.
All nine are Research Track papers. Per the MLSys 2026 calendar, the orals split across two sessions — LLM Training 2 (Wed, May 20) and LLM Serving 5 (Thu, May 21) — so I tag each below with its track and whether it targets LLM TRAINING or LLM SERVING.
1 · RL + Speculative Decoding
The rollout/generation stage now dominates RL post-training time. Both papers retarget SD at the RL loop — where the drafter and policy are both moving.
Generation can eat 75%+ of RL training time. ReSpec tackles the three ways naive SD breaks in RL — vanishing gains at large batch sizes, drafter staleness as the actor updates, and drafter-induced policy degradation — via dynamic SD config tuning, drafter evolution by distillation, and reward-weighted updates.
A small fraction of long rollouts dominates wall-clock time. DAS drops the neural drafter for an adaptive, nonparametric one built from recent rollouts via an incrementally maintained suffix tree, plus a length-aware policy that spends more speculation budget on the long trajectories that set the makespan — without changing model outputs.
Co-author Yiying Zhang's write-up (with Together AI) reports the split: 50% faster rollout on math RL, 25% on code RL — the trick is exploiting recurring prompt patterns across training epochs, since RL weights keep moving unlike fixed-model serving.
2 · Self-Speculative Decoding
One model is both drafter and verifier — no separate draft checkpoint to train or serve.
Long chain-of-thought makes reasoning inference memory-bound. PillarAttn drafts using only the critical tokens (highest attention scores) and cleverly reuses attention scores from the verification stage, paired with unified draft/verify scheduling, delayed verification for overlap, and dynamic KV-cache management. Lossless and training-free.
A self-speculative approach that injects a learned "seed" to drive the model's own draft path. (Poster; preprint not yet located at time of writing.)
3 · Benchmarking & Reality Checks
The first systematic study of SD on a production-grade engine (vLLM) — five variants (n-gram, EAGLE/EAGLE-3, draft-model, MTP) over four models and six workloads at batch sizes 1–128. The punchline: SD works, but gains shrink as batch grows and the system turns compute-bound; target-model verification dominates execution, and there's a large gap between observed and theoretical upper bounds.
4 · Drafter Design & New Workloads
Pushing SD onto new drafter architectures (diffusion), new target families (MoE), and new hardware (the laptop NPU).
Uses a discrete-diffusion, non-autoregressive drafter to kill the sequential dependency in drafting, plus new techniques to calibrate the diffusion drafter against the autoregressive verifier (reducing rejections).
Disaggregates each predictive step across different parameter sets, decoupling drafter capacity from inference cost — longer accepted draft sequences at minimal draft latency.
In MoEs, draft tokens collectively activate more experts, inflating verification cost 2–3× — so naive SD can slow things down up to 1.5×. Cascade defines a "speculation utility" metric (token gain / verification cost) and uses its iteration-level locality to switch speculation on/off and tune the draft length K.
Brings SD to the laptop "AI PC," pipelining draft and verify across the heterogeneous CPU/GPU/NPU units of consumer silicon. (Poster; preprint not yet located at time of writing.)
The arc across these nine: SD has graduated from a single trick into a design space. The frontier moved from inference latency (the original use) into RL training throughput (ReSpec, DAS), the drafter went from a small LM to a suffix tree (DAS), a diffusion model (SpecDiff-2), or the model itself (SpecGen); and "Performance or Illusion?" is the field's honest mirror — at large batch the verifier dominates and the easy wins are gone.
Invited Talks
The invited-talk lineup reads like a map of where systems people think the field is going — less about a single kernel, more about co-design across the whole stack, and a recurring meta-theme: AI is now writing the systems we used to write by hand.
What's Next — The Innovation Cycle
Step back from any single paper. Over the past decade I've lived through three big technology cycles as a builder — and the speculative-decoding sprint above is just the current one in motion. The diagram below shows them two ways.
Pattern 1 — hype waves (peaks & troughs). Each technology has its own crest in collective attention. Cloud Native took off first — Kubernetes was open-sourced in 2014, the CNCF formed in 2015, and it became the default substrate by ~2017–2019, then plateaued rather than fading. NLP / Conversational AI Understanding crested around 2019–2020 (the chatbot and voice-assistant era), then got absorbed into something bigger. LLM / AI Native looks like it appeared overnight with ChatGPT's launch on November 30, 2022 — but the curve was never zero. It sits on decades of reinforcement-learning foundations (Barto & Sutton's work from the 1980s, recognized with the 2024 Turing Award) and the 2017 Transformer (“Attention Is All You Need”). You can even see the early tremors in the curve: AlphaGo's defeat of Lee Sedol (2016), AlphaZero (2017), and AlphaStar / OpenAI Five (2019) were each a small public crest of RL that rose and receded. What changed in late 2022 was the vertical takeoff — RLHF aligned the models and ChatGPT put them in everyone's hands — and unlike the first two, its curve is still climbing in 2026.
Pattern 2 — adoption eras (the flat steps). Strip out the hype and look at what I actually built and shipped as a builder, and you get three parallel plateaus, each higher than the last:
- 2019–2020 — NLP. Building voice AI in the health & wellness domain — intent, slots, and conversational understanding, before the transformer ate everything.
- 2020–2022 — Container & Serverless. Shipping on AWS Fargate and incubating AWS App Runner — virtualization and abstraction of the compute layer as the unit of work, as the platform matured.
- 2023–now — LLM (ongoing). AI-native and agentic systems. When Amazon Bedrock went GA in 2023, helped found the Anthropic Claude model inference engine on Bedrock, partnering directly with Anthropic (co-founder Ben Mann). That work touched on a few ideas that are now well-known industry standards across today's open-source community — disaggregated inference, multi-node inference, context-aware routing, and prompt caching. Through 2024, led almost all Claude models' Day-0 public releases on Bedrock — including Computer Use, Anthropic's first agentic-AI feature, on Claude 3.5 v2. Delivered the first Claude model release on the AWS Trainium/Inferentia platform and grew it to carry ~90% of Bedrock traffic on Trainium chips, partnering with James Bradbury (Anthropic's Head of Compute) and the Annapurna teams. In 2025, shifted to open-weights model optimization — post-training (fine-tuning) and inference optimization: training EAGLE draft models for speculative decoding, quantization, and kernel tuning, across both SageMaker AI and Bedrock. The open arrow is the point: this era hasn't peaked, which is exactly why the MLSys 2026 program looks the way it does.
The heights aren't arbitrary — each cycle absorbed and stood on the previous one. NLP's understanding problem became a sub-task of LLMs; cloud-native infrastructure became the substrate LLMs are served and sandboxed on. You can even read it in the institutions: the Linux Foundation has carried the baton across the right half of this chart — from the CNCF (cloud native) to the PyTorch Foundation and now the Agentic AI Foundation (AAIF). The ladder only goes up. A concrete example of the throughline: Firecracker — the microVM born in that cloud-native era — is now actively adopted as a sandbox runtime for agentic-AI compute.
So what's the next step on the ladder? The bet from San Francisco points past the screen and into the physical world: robotics, world models, and embodied AI — systems that don't just generate tokens but perceive, predict, and act.
But here's the honest catch, and it's why this plateau hasn't crystallized yet: the definition is still wide open. “World model” is a Hamlet — 一千个人眼里有一千个哈姆雷特 (a thousand people, a thousand Hamlets). Ask ten builders what it means and you'll get ten answers: a humanoid? a self-driving stack? a world model you can query? an agent with an API and a gripper? Each of the previous three cycles eventually crystallized around a clear unit of work — an intent, a container, a token. The next one won't truly take off until it finds its unit. Until then, the MLSys-style work above — efficient inference, hardware/software co-design, self-optimizing infrastructure — is exactly what lays the track for whichever definition ends up winning.
Co-authored with Claude.