MLSys 2026: Speculative Decoding, Invited Talks & What's Next

MLSys 2026 (May 18–21, Bellevue, WA) just wrapped. This is my post-conference reading guide: the speculative-decoding papers worth your time grouped by theme, the invited-talk lineup, and a step back to think about where the next innovation cycle is heading.

Disclaimer: This is a personal reading guide from an individual builder's perspective. All groupings, summaries, and opinions are my own; headline numbers are quoted from the papers' own abstracts.




Speculative Decoding @ MLSys 2026

Speculative decoding (SD) is one of the loudest themes at MLSys 2026 — nine papers across the program, and the conversation has clearly shifted. It's no longer "does draft-then-verify work?" but "where does it actually pay off, and where is it an illusion?" Below I group the papers by the problem they attack. Each links to its MLSys page and arXiv preprint.

Note: SemiAnalysis's MLSys 2026 preview flags the same currents from the analyst side — speculative decoding for RL's long-tail rollouts, sparse attention moving into production, hardware/software co-design via disaggregation, and MoE-serving challenges.

All nine are Research Track papers. Per the MLSys 2026 calendar, the orals split across two sessions — LLM Training 2 (Wed, May 20) and LLM Serving 5 (Thu, May 21) — so I tag each below with its track and whether it targets LLM TRAINING or LLM SERVING.

1 · RL + Speculative Decoding

The rollout/generation stage now dominates RL post-training time. Both papers retarget SD at the RL loop — where the drafter and policy are both moving.

ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems ORAL RESEARCH LLM TRAINING
Qiaoling Chen, Zijun Liu, Peng Sun, Shenggui Li, Guoteng Wang, Ziming Liu, Yonggang Wen, Siyuan Feng, Tianwei Zhang

Generation can eat 75%+ of RL training time. ReSpec tackles the three ways naive SD breaks in RL — vanishing gains at large batch sizes, drafter staleness as the actor updates, and drafter-induced policy degradation — via dynamic SD config tuning, drafter evolution by distillation, and reward-weighted updates.

up to 4.5× speedup, reward convergence preserved
Beat the Long Tail: Distribution-Aware Speculative Decoding for RL Training (DAS) ORAL RESEARCH LLM TRAINING
Zelei Shao, Vikranth Srivatsa, Sanjana Srivastava, Qingyang Wu, Alpay Ariyak, Xiaoxia Wu, Ameen Patel, Jue Wang, Percy Liang, Tri Dao, Ce Zhang, Yiying Zhang, Ben Athiwaratkun, Chenfeng Xu, Junxiong Wang

A small fraction of long rollouts dominates wall-clock time. DAS drops the neural drafter for an adaptive, nonparametric one built from recent rollouts via an incrementally maintained suffix tree, plus a length-aware policy that spends more speculation budget on the long trajectories that set the makespan — without changing model outputs.

up to 50% rollout-time reduction, identical training curves

Co-author Yiying Zhang's write-up (with Together AI) reports the split: 50% faster rollout on math RL, 25% on code RL — the trick is exploiting recurring prompt patterns across training epochs, since RL weights keep moving unlike fixed-model serving.

2 · Self-Speculative Decoding

One model is both drafter and verifier — no separate draft checkpoint to train or serve.

Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding (SpecGen) ORAL RESEARCH LLM SERVING
Yilong Zhao, Jiaming Tang, Kan Zhu, Zihao Ye, Chi-Chih Chang, Chaofan Lin, Jongseok Park, Guangxuan Xiao, Mohamed Abdelfattah, Mingyu Gao, Baris Kasikci, Song Han, Ion Stoica

Long chain-of-thought makes reasoning inference memory-bound. PillarAttn drafts using only the critical tokens (highest attention scores) and cleverly reuses attention scores from the verification stage, paired with unified draft/verify scheduling, delayed verification for overlap, and dynamic KV-cache management. Lossless and training-free.

up to 2.13× throughput over prior SOTA
Accelerating LLM Inference: Self-Speculative Decoding via Learned Seed Injection POSTER RESEARCH LLM SERVING
Anuradha Pandey

A self-speculative approach that injects a learned "seed" to drive the model's own draft path. (Poster; preprint not yet located at time of writing.)

3 · Benchmarking & Reality Checks

Speculative Decoding: Performance or Illusion? ORAL RESEARCH LLM SERVING
Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, Alvin Cheung

The first systematic study of SD on a production-grade engine (vLLM) — five variants (n-gram, EAGLE/EAGLE-3, draft-model, MTP) over four models and six workloads at batch sizes 1–128. The punchline: SD works, but gains shrink as batch grows and the system turns compute-bound; target-model verification dominates execution, and there's a large gap between observed and theoretical upper bounds.

4 · Drafter Design & New Workloads

Pushing SD onto new drafter architectures (diffusion), new target families (MoE), and new hardware (the laptop NPU).

SpecDiff-2: Scaling Diffusion Drafter Alignment for Faster Speculative Decoding ORAL RESEARCH LLM SERVING
Jameson Sandler, Jacob K. Christopher, Tom Hartvigsen, Ferdinando Fioretto

Uses a discrete-diffusion, non-autoregressive drafter to kill the sequential dependency in drafting, plus new techniques to calibrate the diffusion drafter against the autoregressive verifier (reducing rejections).

+55% avg tokens/sec, up to 5.5× over standard decoding, no accuracy loss
PRISM: Parametrically Refactoring Inference for Speculative Sampling Draft Models ORAL RESEARCH LLM SERVING
Xuliang Wang, Yuetao Chen, Maochan Zhen, Fang Liu, Xinzhou Zheng, Xingwu Liu, Hong Xu, Ming Li

Disaggregates each predictive step across different parameter sets, decoupling drafter capacity from inference cost — longer accepted draft sequences at minimal draft latency.

>2.6× decoding throughput on an already-optimized engine
Cascade: Utility-Driven Speculative Decoding for Mixture-of-Experts POSTER RESEARCH LLM SERVING
Anish Saxena, Po-An Tsai, Hritvik Taneja, Aamer Jaleel, Moinuddin Qureshi

In MoEs, draft tokens collectively activate more experts, inflating verification cost 2–3× — so naive SD can slow things down up to 1.5×. Cascade defines a "speculation utility" metric (token gain / verification cost) and uses its iteration-level locality to switch speculation on/off and tune the draft length K.

slowdown capped at 5% (vs 1.5×), +7–14% throughput over static K
SD-HC: Heterogeneous Functional Pipelining for Speculative LLM Decoding on AI PCs POSTER RESEARCH LLM SERVING
Xikai (Noah) Meng, Chao Li, Spandan Tiwari

Brings SD to the laptop "AI PC," pipelining draft and verify across the heterogeneous CPU/GPU/NPU units of consumer silicon. (Poster; preprint not yet located at time of writing.)

The arc across these nine: SD has graduated from a single trick into a design space. The frontier moved from inference latency (the original use) into RL training throughput (ReSpec, DAS), the drafter went from a small LM to a suffix tree (DAS), a diffusion model (SpecDiff-2), or the model itself (SpecGen); and "Performance or Illusion?" is the field's honest mirror — at large batch the verifier dominates and the easy wins are gone.




Invited Talks

The invited-talk lineup reads like a map of where systems people think the field is going — less about a single kernel, more about co-design across the whole stack, and a recurring meta-theme: AI is now writing the systems we used to write by hand.

Keynote — Amin Vahdat (SVP & Chief Technologist, AI & Infrastructure, Google)
The Next Horizon of Systems: From MLSys to System IntelligenceLidong Zhou (Chief Scientist, Microsoft Asia-Pacific R&D). Argues for "system intelligence," where AI reshapes the systems discipline itself — new forms of reasoning, design, validation, and evolution.
The Path to Inference EfficiencyChristos Kozyrakis (NVIDIA / Stanford). Big gains come not from isolated tricks but from treating hardware, systems, and models as one integrated stack.
Rethinking Pretraining: Data and ArchitectureLuke Zettlemoyer (UW / Meta). Nearly all advanced capabilities ultimately trace back to the pretraining data.
Beyond Model Serving: Cross-Stack Co-Design for Agentic SystemsEsha Choukse (Microsoft Azure Research). Treat accuracy and quality as dynamic system-level quantities to trade against latency, cost, and energy.
When AI Starts Writing Systems CodeMark Saroufim (GPU MODE, ex-Meta). Kernel LLMs for GPU optimization — and how to make AI-generated kernels production-ready.
Rethinking Open Source Contribution in the Age of AI AgentsRoger Wang (vLLM core maintainer). As AI-generated PRs flood in, humans must focus on understanding systems, picking the right problems, and owning what ships.
LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM InferenceYuhan Liu (UChicago). Open-source KV-cache reuse across queries and engines — up to 15× throughput.
Eliciting Language Model Behaviors with Investigator AgentsXiang Lisa Li (OpenAI; incoming UW). Investigator agents that search for prompts inducing specific target behaviors.



What's Next — The Innovation Cycle

Step back from any single paper. Over the past decade I've lived through three big technology cycles as a builder — and the speculative-decoding sprint above is just the current one in motion. The diagram below shows them two ways.

Two views of three innovation cycles over 2016-2026: hype waves and stacked adoption eras for NLP/Conversational AI, Cloud Native, and LLM/AI Native

Pattern 1 — hype waves (peaks & troughs). Each technology has its own crest in collective attention. Cloud Native took off first — Kubernetes was open-sourced in 2014, the CNCF formed in 2015, and it became the default substrate by ~2017–2019, then plateaued rather than fading. NLP / Conversational AI Understanding crested around 2019–2020 (the chatbot and voice-assistant era), then got absorbed into something bigger. LLM / AI Native looks like it appeared overnight with ChatGPT's launch on November 30, 2022 — but the curve was never zero. It sits on decades of reinforcement-learning foundations (Barto & Sutton's work from the 1980s, recognized with the 2024 Turing Award) and the 2017 Transformer (“Attention Is All You Need”). You can even see the early tremors in the curve: AlphaGo's defeat of Lee Sedol (2016), AlphaZero (2017), and AlphaStar / OpenAI Five (2019) were each a small public crest of RL that rose and receded. What changed in late 2022 was the vertical takeoff — RLHF aligned the models and ChatGPT put them in everyone's hands — and unlike the first two, its curve is still climbing in 2026.

Pattern 2 — adoption eras (the flat steps). Strip out the hype and look at what I actually built and shipped as a builder, and you get three parallel plateaus, each higher than the last:

The heights aren't arbitrary — each cycle absorbed and stood on the previous one. NLP's understanding problem became a sub-task of LLMs; cloud-native infrastructure became the substrate LLMs are served and sandboxed on. You can even read it in the institutions: the Linux Foundation has carried the baton across the right half of this chart — from the CNCF (cloud native) to the PyTorch Foundation and now the Agentic AI Foundation (AAIF). The ladder only goes up. A concrete example of the throughline: Firecracker — the microVM born in that cloud-native era — is now actively adopted as a sandbox runtime for agentic-AI compute.

So what's the next step on the ladder? The bet from San Francisco points past the screen and into the physical world: robotics, world models, and embodied AI — systems that don't just generate tokens but perceive, predict, and act.

But here's the honest catch, and it's why this plateau hasn't crystallized yet: the definition is still wide open. “World model” is a Hamlet — 一千个人眼里有一千个哈姆雷特 (a thousand people, a thousand Hamlets). Ask ten builders what it means and you'll get ten answers: a humanoid? a self-driving stack? a world model you can query? an agent with an API and a gripper? Each of the previous three cycles eventually crystallized around a clear unit of work — an intent, a container, a token. The next one won't truly take off until it finds its unit. Until then, the MLSys-style work above — efficient inference, hardware/software co-design, self-optimizing infrastructure — is exactly what lays the track for whichever definition ends up winning.



Co-authored with Claude.

Hits