MLSys 2026: Speculative Decoding, Invited Talks & What's NextMLSys 2026：投机解码、特邀报告与未来展望

Created in May, 2026 · 2,113 words · 11 min read 创建于 2026 年 5 月 · 约 3,333 字 · 10 分钟阅读

2026 · mlsys speculative-decoding llm · research

MLSys 2026 (May 18–21, Bellevue, WA) just wrapped. This is my post-conference reading guide: the speculative-decoding papers worth your time grouped by theme, the invited-talk lineup, and a step back to think about where the next innovation cycle is heading.MLSys 2026（5 月 18–21 日，华盛顿州贝尔维尤）刚刚落幕。这是我的会后阅读指南：按主题分组的、值得一读的投机解码（speculative decoding）论文，特邀报告阵容，以及退后一步、思考下一个创新周期走向何方。

Disclaimer: This is a personal reading guide from an individual builder's perspective. All groupings, summaries, and opinions are my own; headline numbers are quoted from the papers' own abstracts.声明：这是一份来自个人 builder 视角的阅读指南。所有分组、总结与观点均为我个人所有；标题数字引自论文各自的摘要。

Speculative Decoding @ MLSys 2026

Speculative decoding (SD) is one of the loudest themes at MLSys 2026 — nine papers across the program, and the conversation has clearly shifted. It's no longer "does draft-then-verify work?" but "where does it actually pay off, and where is it an illusion?" Below I group the papers by the problem they attack. Each links to its MLSys page and arXiv preprint.

Note: SemiAnalysis's MLSys 2026 preview flags the same currents from the analyst side — speculative decoding for RL's long-tail rollouts, sparse attention moving into production, hardware/software co-design via disaggregation, and MoE-serving challenges.

All nine are Research Track papers. Per the MLSys 2026 calendar, the orals split across two sessions — LLM Training 2 (Wed, May 20) and LLM Serving 5 (Thu, May 21) — so I tag each below with its track and whether it targets LLM TRAINING or LLM SERVING.

1 · RL + Speculative Decoding

The rollout/generation stage now dominates RL post-training time. Both papers retarget SD at the RL loop — where the drafter and policy are both moving.

ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems ORAL RESEARCH LLM TRAINING

Qiaoling Chen, Zijun Liu, Peng Sun, Shenggui Li, Guoteng Wang, Ziming Liu, Yonggang Wen, Siyuan Feng, Tianwei Zhang

Generation can eat 75%+ of RL training time. ReSpec tackles the three ways naive SD breaks in RL — vanishing gains at large batch sizes, drafter staleness as the actor updates, and drafter-induced policy degradation — via dynamic SD config tuning, drafter evolution by distillation, and reward-weighted updates.

up to 4.5× speedup, reward convergence preserved

MLSys · arXiv:2510.26475 · PDF

Beat the Long Tail: Distribution-Aware Speculative Decoding for RL Training (DAS) ORAL RESEARCH LLM TRAINING

Zelei Shao, Vikranth Srivatsa, Sanjana Srivastava, Qingyang Wu, Alpay Ariyak, Xiaoxia Wu, Ameen Patel, Jue Wang, Percy Liang, Tri Dao, Ce Zhang, Yiying Zhang, Ben Athiwaratkun, Chenfeng Xu, Junxiong Wang

A small fraction of long rollouts dominates wall-clock time. DAS drops the neural drafter for an adaptive, nonparametric one built from recent rollouts via an incrementally maintained suffix tree, plus a length-aware policy that spends more speculation budget on the long trajectories that set the makespan — without changing model outputs.

up to 50% rollout-time reduction, identical training curves

MLSys · arXiv:2511.13841 · PDF · author's note

Co-author Yiying Zhang's write-up (with Together AI) reports the split: 50% faster rollout on math RL, 25% on code RL — the trick is exploiting recurring prompt patterns across training epochs, since RL weights keep moving unlike fixed-model serving.

2 · Self-Speculative Decoding

One model is both drafter and verifier — no separate draft checkpoint to train or serve.

Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding (SpecGen) ORAL RESEARCH LLM SERVING

Yilong Zhao, Jiaming Tang, Kan Zhu, Zihao Ye, Chi-Chih Chang, Chaofan Lin, Jongseok Park, Guangxuan Xiao, Mohamed Abdelfattah, Mingyu Gao, Baris Kasikci, Song Han, Ion Stoica

Long chain-of-thought makes reasoning inference memory-bound. PillarAttn drafts using only the critical tokens (highest attention scores) and cleverly reuses attention scores from the verification stage, paired with unified draft/verify scheduling, delayed verification for overlap, and dynamic KV-cache management. Lossless and training-free.

up to 2.13× throughput over prior SOTA

MLSys · arXiv:2512.01278 · PDF

Accelerating LLM Inference: Self-Speculative Decoding via Learned Seed Injection POSTER RESEARCH LLM SERVING

Anuradha Pandey

A self-speculative approach that injects a learned "seed" to drive the model's own draft path. (Poster; preprint not yet located at time of writing.)

MLSys

3 · Benchmarking & Reality Checks

Speculative Decoding: Performance or Illusion? ORAL RESEARCH LLM SERVING

Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, Alvin Cheung

The first systematic study of SD on a production-grade engine (vLLM) — five variants (n-gram, EAGLE/EAGLE-3, draft-model, MTP) over four models and six workloads at batch sizes 1–128. The punchline: SD works, but gains shrink as batch grows and the system turns compute-bound; target-model verification dominates execution, and there's a large gap between observed and theoretical upper bounds.

MLSys · arXiv:2601.11580 · PDF · benchmark site

4 · Drafter Design & New Workloads

Pushing SD onto new drafter architectures (diffusion), new target families (MoE), and new hardware (the laptop NPU).

SpecDiff-2: Scaling Diffusion Drafter Alignment for Faster Speculative Decoding ORAL RESEARCH LLM SERVING

Jameson Sandler, Jacob K. Christopher, Tom Hartvigsen, Ferdinando Fioretto

Uses a discrete-diffusion, non-autoregressive drafter to kill the sequential dependency in drafting, plus new techniques to calibrate the diffusion drafter against the autoregressive verifier (reducing rejections).

+55% avg tokens/sec, up to 5.5× over standard decoding, no accuracy loss

MLSys · arXiv:2511.00606 · PDF

PRISM: Parametrically Refactoring Inference for Speculative Sampling Draft Models ORAL RESEARCH LLM SERVING

Xuliang Wang, Yuetao Chen, Maochan Zhen, Fang Liu, Xinzhou Zheng, Xingwu Liu, Hong Xu, Ming Li

Disaggregates each predictive step across different parameter sets, decoupling drafter capacity from inference cost — longer accepted draft sequences at minimal draft latency.

>2.6× decoding throughput on an already-optimized engine

MLSys · arXiv:2602.01762 · PDF

Cascade: Utility-Driven Speculative Decoding for Mixture-of-Experts POSTER RESEARCH LLM SERVING

Anish Saxena, Po-An Tsai, Hritvik Taneja, Aamer Jaleel, Moinuddin Qureshi

In MoEs, draft tokens collectively activate more experts, inflating verification cost 2–3× — so naive SD can slow things down up to 1.5×. Cascade defines a "speculation utility" metric (token gain / verification cost) and uses its iteration-level locality to switch speculation on/off and tune the draft length K.

slowdown capped at 5% (vs 1.5×), +7–14% throughput over static K

MLSys · arXiv:2506.20675 · PDF

SD-HC: Heterogeneous Functional Pipelining for Speculative LLM Decoding on AI PCs POSTER RESEARCH LLM SERVING

Xikai (Noah) Meng, Chao Li, Spandan Tiwari

Brings SD to the laptop "AI PC," pipelining draft and verify across the heterogeneous CPU/GPU/NPU units of consumer silicon. (Poster; preprint not yet located at time of writing.)

MLSys

The arc across these nine: SD has graduated from a single trick into a design space. The frontier moved from inference latency (the original use) into RL training throughput (ReSpec, DAS), the drafter went from a small LM to a suffix tree (DAS), a diffusion model (SpecDiff-2), or the model itself (SpecGen); and "Performance or Illusion?" is the field's honest mirror — at large batch the verifier dominates and the easy wins are gone.

Invited Talks

The invited-talk lineup reads like a map of where systems people think the field is going — less about a single kernel, more about co-design across the whole stack, and a recurring meta-theme: AI is now writing the systems we used to write by hand.

Keynote — Amin Vahdat (SVP & Chief Technologist, AI & Infrastructure, Google)

The Next Horizon of Systems: From MLSys to System Intelligence — Lidong Zhou (Chief Scientist, Microsoft Asia-Pacific R&D). Argues for "system intelligence," where AI reshapes the systems discipline itself — new forms of reasoning, design, validation, and evolution.

The Path to Inference Efficiency — Christos Kozyrakis (NVIDIA / Stanford). Big gains come not from isolated tricks but from treating hardware, systems, and models as one integrated stack.

Rethinking Pretraining: Data and Architecture — Luke Zettlemoyer (UW / Meta). Nearly all advanced capabilities ultimately trace back to the pretraining data.

Beyond Model Serving: Cross-Stack Co-Design for Agentic Systems — Esha Choukse (Microsoft Azure Research). Treat accuracy and quality as dynamic system-level quantities to trade against latency, cost, and energy.

When AI Starts Writing Systems Code — Mark Saroufim (GPU MODE, ex-Meta). Kernel LLMs for GPU optimization — and how to make AI-generated kernels production-ready.

Rethinking Open Source Contribution in the Age of AI Agents — Roger Wang (vLLM core maintainer). As AI-generated PRs flood in, humans must focus on understanding systems, picking the right problems, and owning what ships.

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference — Yuhan Liu (UChicago). Open-source KV-cache reuse across queries and engines — up to 15× throughput.

Eliciting Language Model Behaviors with Investigator Agents — Xiang Lisa Li (OpenAI; incoming UW). Investigator agents that search for prompts inducing specific target behaviors.

What's Next — The Innovation Cycle

Step back from any single paper. Over the past decade I've lived through three big technology cycles as a builder — and the speculative-decoding sprint above is just the current one in motion. The diagram below shows them two ways.

Pattern 1 — hype waves (peaks & troughs). Each technology has its own crest in collective attention. Cloud Native took off first — Kubernetes was open-sourced in 2014, the CNCF formed in 2015, and it became the default substrate by ~2017–2019, then plateaued rather than fading. NLP / Conversational AI Understanding crested around 2019–2020 (the chatbot and voice-assistant era), then got absorbed into something bigger. LLM / AI Native looks like it appeared overnight with ChatGPT's launch on November 30, 2022 — but the curve was never zero. It sits on decades of reinforcement-learning foundations (Barto & Sutton's work from the 1980s, recognized with the 2024 Turing Award) and the 2017 Transformer (“Attention Is All You Need”). You can even see the early tremors in the curve: AlphaGo's defeat of Lee Sedol (2016), AlphaZero (2017), and AlphaStar / OpenAI Five (2019) were each a small public crest of RL that rose and receded. What changed in late 2022 was the vertical takeoff — RLHF aligned the models and ChatGPT put them in everyone's hands — and unlike the first two, its curve is still climbing in 2026.

Pattern 2 — adoption eras (the flat steps). Strip out the hype and look at what I actually built and shipped as a builder, and you get three parallel plateaus, each higher than the last:

2019–2020 — NLP. Building voice AI in the health & wellness domain — intent, slots, and conversational understanding, before the transformer ate everything.
2020–2022 — Container & Serverless. Shipping on AWS Fargate and incubating AWS App Runner — virtualization and abstraction of the compute layer as the unit of work, as the platform matured.
2023–now — LLM (ongoing). AI-native and agentic systems. When Amazon Bedrock went GA in 2023, helped found the Anthropic Claude model inference engine on Bedrock, partnering directly with Anthropic (co-founder Ben Mann). That work touched on a few ideas that are now well-known industry standards across today's open-source community — disaggregated inference, multi-node inference, context-aware routing, and prompt caching. Through 2024, led almost all Claude models' Day-0 public releases on Bedrock — including Computer Use, Anthropic's first agentic-AI feature, on Claude 3.5 v2. Delivered the first Claude model release on the AWS Trainium/Inferentia platform and grew it to carry ~90% of Bedrock traffic on Trainium chips, partnering with James Bradbury (Anthropic's Head of Compute) and the Annapurna teams. In 2025, shifted to open-weights model optimization — post-training (fine-tuning) and inference optimization: training EAGLE draft models for speculative decoding, quantization, and kernel tuning, across both SageMaker AI and Bedrock. The open arrow is the point: this era hasn't peaked, which is exactly why the MLSys 2026 program looks the way it does.

The heights aren't arbitrary — each cycle absorbed and stood on the previous one. NLP's understanding problem became a sub-task of LLMs; cloud-native infrastructure became the substrate LLMs are served and sandboxed on. You can even read it in the institutions: the Linux Foundation has carried the baton across the right half of this chart — from the CNCF (cloud native) to the PyTorch Foundation and now the Agentic AI Foundation (AAIF). The ladder only goes up. A concrete example of the throughline: Firecracker — the microVM born in that cloud-native era — is now actively adopted as a sandbox runtime for agentic-AI compute.

So what's the next step on the ladder? The bet from San Francisco points past the screen and into the physical world: robotics, world models, and embodied AI — systems that don't just generate tokens but perceive, predict, and act.

But here's the honest catch, and it's why this plateau hasn't crystallized yet: the definition is still wide open. “World model” is a Hamlet — 一千个人眼里有一千个哈姆雷特 (a thousand people, a thousand Hamlets). Ask ten builders what it means and you'll get ten answers: a humanoid? a self-driving stack? a world model you can query? an agent with an API and a gripper?

Jun 3, 2026 · Dr. Fei-Fei Li shared her definition of a world model in the article "A Functional Taxonomy of World Models".

Each of the previous three cycles eventually crystallized around a clear unit of work — an intent, a container, a token. The next one won't truly take off until it finds its unit. Until then, the MLSys-style work above — efficient inference, hardware/software co-design, self-optimizing infrastructure — is exactly what lays the track for whichever definition ends up winning.

Co-authored with Claude.

MLSys 2026 上的投机解码

投机解码（Speculative Decoding，SD）是 MLSys 2026 上声量最大的主题之一——全议程共有九篇论文，而且讨论的重心已经明显转移：不再是"草稿-验证（draft-then-verify）行不行？"，而是"它到底在哪里真正划算、在哪里只是幻觉？"下面我按论文所攻克的问题分组，每篇都附上 MLSys 页面与 arXiv 预印本链接。

注：SemiAnalysis 的 MLSys 2026 前瞻从分析师视角标记了同样的潮流——面向 RL 长尾 rollout 的投机解码、走向生产环境的稀疏注意力、通过分离架构实现的软硬件协同设计，以及 MoE 推理服务的挑战。

九篇论文全部来自 Research Track。根据 MLSys 2026 日程，口头报告分布在两个 session——LLM Training 2（5 月 20 日周三）与 LLM Serving 5（5 月 21 日周四）——因此我在下面为每篇标注其 track，以及它面向 LLM TRAINING 还是 LLM SERVING。

1 · RL + 投机解码

rollout/生成阶段如今主导着 RL 后训练的时间开销。这两篇论文把 SD 重新对准 RL 循环——在那里，drafter 和 policy 都在不断变化。

ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems ORAL RESEARCH LLM TRAINING

Qiaoling Chen, Zijun Liu, Peng Sun, Shenggui Li, Guoteng Wang, Ziming Liu, Yonggang Wen, Siyuan Feng, Tianwei Zhang

生成可占 RL 训练时间的 75% 以上。ReSpec 解决了朴素 SD 在 RL 中失效的三种方式——大 batch 下收益消失、actor 更新导致 drafter 过时、以及 drafter 引起的策略退化——手段是动态 SD 配置调优、以蒸馏演化 drafter，以及按奖励加权的更新。

最高 4.5× 加速，奖励收敛保持不变

MLSys · arXiv:2510.26475 · PDF

Beat the Long Tail: Distribution-Aware Speculative Decoding for RL Training (DAS) ORAL RESEARCH LLM TRAINING

一小部分长 rollout 主导了墙钟时间。DAS 抛弃神经网络 drafter，改用一个自适应的非参数 drafter——基于最近 rollout、以增量维护的后缀树构建——再配合一个长度感知的策略，把更多投机预算花在决定总完工时间的长轨迹上，且不改变模型输出。

rollout 时间最多减少 50%，训练曲线完全一致

MLSys · arXiv:2511.13841 · PDF · 作者手记

合著者 Yiying Zhang（与 Together AI 合作）的文章给出了细分数据：数学 RL rollout 加速 50%，代码 RL 加速 25%——诀窍在于利用训练 epoch 之间反复出现的 prompt 模式，因为与固定模型的推理服务不同，RL 的权重在持续变化。

2 · 自投机解码

一个模型同时充当 drafter 与 verifier——无需训练或部署单独的草稿模型。

Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding (SpecGen) ORAL RESEARCH LLM SERVING

Yilong Zhao, Jiaming Tang, Kan Zhu, Zihao Ye, Chi-Chih Chang, Chaofan Lin, Jongseok Park, Guangxuan Xiao, Mohamed Abdelfattah, Mingyu Gao, Baris Kasikci, Song Han, Ion Stoica

长思维链让推理模型的推断受限于内存。PillarAttn 只用关键 token（注意力分数最高者）来打草稿，并巧妙复用验证阶段的注意力分数，再配合统一的草稿/验证调度、用于重叠的延迟验证，以及动态 KV 缓存管理。无损且免训练。

吞吐量最高超越先前 SOTA 2.13×

MLSys · arXiv:2512.01278 · PDF

Accelerating LLM Inference: Self-Speculative Decoding via Learned Seed Injection POSTER RESEARCH LLM SERVING

Anuradha Pandey

一种自投机方案：注入一个学习得到的"种子"来驱动模型自身的草稿路径。（Poster；截稿时尚未找到预印本。）

MLSys

3 · 基准测试与现实检验

Speculative Decoding: Performance or Illusion? ORAL RESEARCH LLM SERVING

Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, Alvin Cheung

第一个在生产级引擎（vLLM）上对 SD 的系统性研究——五种变体（n-gram、EAGLE/EAGLE-3、草稿模型、MTP），覆盖四个模型、六种负载、batch size 1–128。结论一针见血：SD 有效，但随着 batch 增大、系统转为计算受限，收益不断缩水；目标模型的验证主导了执行时间，而且实测与理论上界之间存在巨大差距。

MLSys · arXiv:2601.11580 · PDF · 基准站点

4 · Drafter 设计与新负载

把 SD 推向新的 drafter 架构（扩散模型）、新的目标模型家族（MoE）与新的硬件（笔记本 NPU）。

SpecDiff-2: Scaling Diffusion Drafter Alignment for Faster Speculative Decoding ORAL RESEARCH LLM SERVING

Jameson Sandler, Jacob K. Christopher, Tom Hartvigsen, Ferdinando Fioretto

用离散扩散、非自回归的 drafter 消除打草稿中的顺序依赖，并提出新技术把扩散 drafter 对齐到自回归 verifier（减少拒绝率）。

平均 tokens/sec +55%，最高较标准解码 5.5×，无精度损失

MLSys · arXiv:2511.00606 · PDF

PRISM: Parametrically Refactoring Inference for Speculative Sampling Draft Models ORAL RESEARCH LLM SERVING

Xuliang Wang, Yuetao Chen, Maochan Zhen, Fang Liu, Xinzhou Zheng, Xingwu Liu, Hong Xu, Ming Li

把每个预测步骤分解到不同的参数集上，将 drafter 容量与推理成本解耦——以极小的草稿延迟换取更长的被接受草稿序列。

在已高度优化的引擎上解码吞吐 >2.6×

MLSys · arXiv:2602.01762 · PDF

Cascade: Utility-Driven Speculative Decoding for Mixture-of-Experts POSTER RESEARCH LLM SERVING

Anish Saxena, Po-An Tsai, Hritvik Taneja, Aamer Jaleel, Moinuddin Qureshi

在 MoE 中，草稿 token 会共同激活更多专家，使验证成本膨胀 2–3 倍——朴素 SD 甚至可能拖慢至 1.5 倍。Cascade 定义了"投机效用"指标（token 收益 / 验证成本），并利用其迭代级局部性来开关投机、调节草稿长度 K。

最坏减速被限制在 5% 以内（对比 1.5×），较静态 K 吞吐 +7–14%

MLSys · arXiv:2506.20675 · PDF

SD-HC: Heterogeneous Functional Pipelining for Speculative LLM Decoding on AI PCs POSTER RESEARCH LLM SERVING

Xikai (Noah) Meng, Chao Li, Spandan Tiwari

把 SD 带上笔记本"AI PC"，在消费级芯片的异构 CPU/GPU/NPU 单元之间流水线化草稿与验证。（Poster；截稿时尚未找到预印本。）

MLSys

这九篇串起来的主线：SD 已从单一技巧成长为一个设计空间。前沿从推理延迟（最初的用途）移向 RL 训练吞吐（ReSpec、DAS）；drafter 从小语言模型变成了后缀树（DAS）、扩散模型（SpecDiff-2）或模型自身（SpecGen）；而《Performance or Illusion?》是这个领域诚实的镜子——大 batch 下 verifier 主导一切，容易摘的果子已经没有了。

特邀报告

特邀报告阵容读起来就像一张系统领域走向的地图——重点不再是单个 kernel，而是贯穿整个技术栈的协同设计，以及一个反复出现的元主题：AI 正在编写我们过去手写的系统。

主旨演讲 — Amin Vahdat （Google AI 与基础设施 SVP 兼首席技术官）

The Next Horizon of Systems: From MLSys to System Intelligence — Lidong Zhou（微软亚太研发首席科学家）。主张"系统智能"：AI 正在重塑系统学科本身——带来推理、设计、验证与演化的新形态。

The Path to Inference Efficiency — Christos Kozyrakis（NVIDIA / 斯坦福）。大的收益不来自孤立技巧，而来自把硬件、系统与模型当作一个整体栈来对待。

Rethinking Pretraining: Data and Architecture — Luke Zettlemoyer（华盛顿大学 / Meta）。几乎所有高级能力最终都能追溯到预训练数据。

Beyond Model Serving: Cross-Stack Co-Design for Agentic Systems — Esha Choukse（微软 Azure 研究院）。把精度与质量当作可与延迟、成本、能耗互相权衡的动态系统级量。

When AI Starts Writing Systems Code — Mark Saroufim（GPU MODE，前 Meta）。面向 GPU 优化的 Kernel LLM——以及如何让 AI 生成的 kernel 达到生产级。

Rethinking Open Source Contribution in the Age of AI Agents — Roger Wang（vLLM 核心维护者）。当 AI 生成的 PR 涌入时，人类必须专注于理解系统、挑对问题、并对交付的东西负责。

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference — Yuhan Liu（芝加哥大学）。跨查询、跨引擎的开源 KV 缓存复用——吞吐最高 15×。

Eliciting Language Model Behaviors with Investigator Agents — Xiang Lisa Li（OpenAI；即将加入华盛顿大学）。用"调查员智能体"搜索能诱发特定目标行为的 prompt。

下一步 — 创新周期

从任何一篇论文中退后一步。过去十年，作为 builder 我亲历了三个大的技术周期——上面这场投机解码的冲刺，只是当下正在进行的那一个。下图用两种方式呈现它们。

模式一 — 炒作波（波峰与波谷）。每项技术在集体注意力中都有自己的浪尖。云原生最先起飞——Kubernetes 于 2014 年开源，CNCF 于 2015 年成立，到约 2017–2019 年成为默认基座，随后进入平台期而非消退。NLP / 对话式 AI 理解的浪尖在 2019–2020 年前后（聊天机器人与语音助手时代），之后被更大的浪潮吸收。LLM / AI 原生看似随着 ChatGPT 于 2022 年 11 月 30 日的发布一夜之间出现——但这条曲线从来不是零。它坐落在数十年的强化学习根基之上（Barto 与 Sutton 自 1980 年代的工作，获 2024 年图灵奖认可），以及 2017 年的 Transformer（《Attention Is All You Need》）。你甚至能在曲线上看到早期的震颤：AlphaGo 战胜李世石（2016）、AlphaZero（2017）、AlphaStar / OpenAI Five（2019），每一个都是 RL 的一次小小公开浪尖，起而又落。2022 年底改变的是垂直起飞——RLHF 对齐了模型，ChatGPT 把它们交到每个人手中——而与前两条曲线不同，它在 2026 年仍在攀升。

模式二 — 采用时代（平坦的台阶）。剥去炒作，只看我作为 builder 实际构建和交付了什么，就得到三个并行的平台，一个比一个高：

2019–2020 — NLP。在健康领域构建语音 AI——意图、槽位与对话理解，那时 transformer 还没有吞噬一切。
2020–2022 — 容器与无服务器。在 AWS Fargate 上交付、孵化 AWS App Runner——随着平台成熟，计算层的虚拟化与抽象成为工作单元。
2023–至今 — LLM（进行中）。AI 原生与智能体系统。Amazon Bedrock 于 2023 年正式发布时，参与创立了 Bedrock 上的 Anthropic Claude 模型推理引擎，与 Anthropic 直接合作（联合创始人 Ben Mann）。那些工作触及的几个理念如今已是开源社区众所周知的行业标准——分离式推理、多节点推理、上下文感知路由与提示缓存。整个 2024 年，主导了几乎所有 Claude 模型在 Bedrock 上的 Day-0 公开发布——包括 Claude 3.5 v2 上的 Computer Use（Anthropic 首个智能体 AI 功能）。交付了 AWS Trainium/Inferentia 平台上的首个 Claude 模型发布，并与 James Bradbury（Anthropic 计算负责人）及 Annapurna 团队合作，把它扩展到承载 Bedrock 约 90% 的 Trainium 芯片流量。2025 年，转向开源权重模型优化——后训练（微调）与推理优化：为投机解码训练 EAGLE 草稿模型、量化与 kernel 调优，横跨 SageMaker AI 与 Bedrock。那个未闭合的箭头正是重点：这个时代尚未见顶，这也正是 MLSys 2026 议程呈现如今面貌的原因。

台阶的高度并非随意——每个周期都吸收并站在前一个之上。NLP 的理解问题成了 LLM 的子任务；云原生基础设施成了 LLM 得以部署与沙箱化的基座。你甚至能从机构的传承中读出这一点：Linux 基金会在这张图的右半边一路接棒——从 CNCF（云原生）到 PyTorch 基金会，再到如今的 Agentic AI 基金会（AAIF）。梯子只会向上。一个贯穿始终的具体例子：Firecracker——诞生于云原生时代的 microVM——如今正被积极采用为智能体 AI 计算的沙箱运行时。

那么，梯子的下一级是什么？来自旧金山的押注指向屏幕之外的物理世界：机器人、世界模型与具身智能——不只是生成 token，而是感知、预测与行动的系统。

但诚实地说，有一个症结——这也是这个平台尚未成形的原因：定义本身仍然完全开放。"世界模型"就是一个哈姆雷特——一千个人眼里有一千个哈姆雷特。问十位 builder 它是什么，你会得到十个答案：一个人形机器人？一套自动驾驶栈？一个可查询的世界模型？还是一个带 API 和机械爪的智能体？

2026 年 6 月 3 日 · 李飞飞博士在文章 "A Functional Taxonomy of World Models" 中分享了她对世界模型的定义。

此前的三个周期，最终都围绕一个清晰的工作单元凝结成形——一个意图、一个容器、一个 token。下一个周期在找到自己的工作单元之前不会真正起飞。而在那之前，上面这些 MLSys 式的工作——高效推理、软硬件协同设计、自我优化的基础设施——恰恰是在为最终胜出的那个定义铺设轨道。

与 Claude 合著。