LLM Native Primitives: Next Golden Path

Created in May, 2025

2025 · LLM primitive · oss

“Primitives are the raw parts or the most foundational-level building blocks for software developers. They’re indivisible (if they can be functionally split into two they must) and they do one thing really well. They’re meant to be used together rather than as solutions in and of themselves. And, we’ll build them for maximum developer flexibility. We won’t put a bunch of constraints on primitives to guard against developers hurting themselves. Rather, we’ll optimize for developer freedom and innovation.” - 2003 AWS Vision document

"Transformer" turns 8, Cloud Native Computing Foundation (CNCF) turns 10 (refer post), Kubernetes, Lambda, ECS and Alexa turn 11, Bedrock and Claude turn 2. The shift from CPU-native to LLM/GPU-native applications has begun. What's the next golden path for this era ?

LLMs vs Other Apps

I spent two years working on SOTA LLM inference, both closed-source models (Anthropic) and open-source models (Llama and DeepSeek). More recently, I transitioned concentration on model optimization and inference acceleration. The closer you get to the models themselves, the more you realize how non-trivial it is to run and optimize them effectively.

Red Hat AI's tweets "LLM inference is too slow, too expensive, and too hard to scale." provided an in-depth descriptions of those challenges.

LLMs Apps Workloads	Other Apps Workloads
Slow Resource-heavy Non-uniform Highly sensitive to latency	Short-lived, uniform requests Consistent SLOs Even replica performance No benefit from coordinating across replicas

Peter DeSantis had an excellent keynote in re:Invent 2024 that highlighted "AI workloads are Scale up workloads. Larger models demand more compute. Science and algorithmic limitations decide simple scale out doesn't work."

Request Variance: LLM requests vary in size and resource use, RAG → long inputs, short outputs, Reasoning → short inputs, long outputs. This unevenness causes overloaded replicas, leading to longer inter-token latency.

Decode ≠ Prefill: LLM inference is two workloads, Prefill is compute-bound and Decode is memory-bound. Prefill needs a lot of computing resources to convert input into data structure that gets handed off to the next process. Decode (token generation) is that the model generates each token sequentially one at a time. And this puts a very different set of demands on AI infrastructure. Each time a token is generated, the entire model has to be read from memory, but only a small amount of compute is used. For this reason, token generation puts lots of demand on memory but only a small amount of compute. And for agentic AI scenario, customers care about fast prefill and really fast token generation. Standard deployments run both phases on the same GPU—inefficient. Disaggregated serving become necessary, separating these phases to independently scale and optimize them.

Prompt caching essential. Many workloads are multi-turn (chat agents, tool-use, code completion). Routing to replicas with cached context avoids redundant prefill computation. This improves tail latencies by orders of magnitude when done right.

Capacity Scarcity: GPU / accelerators limited. Computing resources are highly limited for every organization, a major difference compared to CPU-native apps, where resource constraints are typically less severe. In this context, Optimized Inference Scheduler (Cache-aware + disaggregation-aware scoring), QoS-aware inference (align latency-sensitive and latency-tolerant workloads to the right infra with better efficiency) and Variant Autoscaling (Traffic + hardware-aware autoscaler) become important. High availability and utilization are essential in the world of LLMs.

Last Mile Delivery

Lately, I’ve seen a flourishing of open-source large model inference frameworks (like llm-d, Dynamo, SGLang..), as well as caching projects such as LMCache, Mooncake, and engines TensorRT, vLLM. Looking back to early 2023, when running a GPU felt like lugging bricks, it’s clear that LLM inference and optimization are now starting to coalesce around some fuzzy but emerging industry-standard paradigms. As someone who’s been hands-on with large model serving for two years, I find this really encouraging.

Although this area doesn’t have the academic depth or community influence of blogs from respected researchers like Lilian Weng or Shunyu Yao, and it lacks the accessibility and popularity of agent-level applications, it still plays a vital role. It’s hard to classify, it's not the model itself, not the application layer, not infrastructure, and not hardware either. It’s perhaps closest in relationship to MCP vs agents and K8s vs CPU. Sitting between the model and the hardware, you could call this the 'middleware layer.'

This middleware layer aims to tackle the real bottlenecks that stand in the way of deploying LLMs at scale: “LLM inference is too slow, too expensive, and too hard to scale.”

These are ‘last mile’ problems, making LLMs run efficiently, reliably, and elastically, and turning model capabilities into real user value. What’s more, nearly every project in this space is choosing open-source collaboration as the path forward, which is incredibly welcoming for those who are passionate about contributing to open-source communities. It feels a lot like the early days of public cloud 10 years ago: from CPU → virtualization → VM → containers → serverless.

The LLM/GPU-native era has begun.

Next Golden Path

At Amazon, there's an internal concept known as 'the golden path', a set of opinionated best practice tooling, architecture, and configuration recommendations for the end-to-end builder experiences, popular options like AWS Lambda and ECS/Fargate. As all CPU-native apps and platforms passed decade, what's the golden path for LLM-native applications?

Communities appear to have already draw a few lines:

Prefill and Decode Disaggregation: Nvidia's recent research publication "Beyond the Buzz: A Pragmatic Take on Inference Disaggregation" offers a detailed explanation and presents practical design principles.

Prefix / KV (key-value) Cache

Context-Aware Routing and AI-Aware Network Routing (LB)

QoS-Aware Routing

Variant Autoscaling

...

Figure: Prefill and Decode Phases of the Transformer

Now the model itself differentiation is shrinking, in the future, this area will become one of key differentiators: how to run models better.

Rather than picking one or two specific software solutions, while waiting the next K8s timing coming, I prefer to talk about a new working model where Product, Engineering, and Research collaborate more closely than ever before. This kind of integration, unlike anything we've seen in the past, is just beginning to take shape. vLLM is a representative example. And the future remains wide open.