CNCF 10 Years: CPU Native vs GPU Native

MAKE CLOUD NATIVE UBIQUITOUS




CNCF (Cloud Native Computing Foundation) turns 10. I published a post when Kubernetes turns 10 in 2024. It's the same sentiment. It’s been a magic journey to be part of these programs for half of their ten-year history. I am honored and humbled for the opportunity to help various roles, such as Local Community Lead, Ambassador, Program Committee Member, and Speaker.

Motivated by Janet Kuo's presentation, 'Kubernetes at 10: A Decade of Community-Powered Innovation,' from the KuberTENes Birthday Bash at Google Mountain View Bay View office, along with the event’s T-shirt design. List my ten KubeCon:

  • KubeCon + CloudNativeCon North America Salt Lake City, Utah 2024 (Program Committee Member)
  • KubeCon + CloudNativeCon India Delhi 2024 (Program Committee Member)
  • KubeCon + CloudNativeCon + Open Source Summit China Hong Kong 2024 (Program Committee Member)
  • KubeCon + CloudNativeCon Europe Paris, France 2024 (Program Committee Member)
  • KubeCon + CloudNativeCon North America Chicago, IL 2023 (Program Committee Member)
  • KubeCon + CloudNativeCon Europe Amsterdam, Netherlands 2023 (Program Committee Member)
  • KubeCon + CloudNativeCon North America Detroit, MI 2022 (Speaker)
  • KubeCon + CloudNativeCon North America LA, CA / Virtual 2021
  • KubeCon + CloudNativeCon North America 2020 Virtual
  • KubeCon + CloudNativeCon North America San Diego, CA 2019 (First KubeCon)

  • Naming as CNCF Ambassador

    (KubeCon EU 2023 Amsterdam, Netherlands. Keukenhof, known as the Garden of Europe)

    I was first named a CNCF Ambassador in late 2022, with the announcement made public during KubeCon Europe 2023 in Amsterdam. I remember at the time, only three Ambassadors were selected from Amazon, one of them was my mentor and colleague, a Principal Engineer in the same VP organization. And I was the only one based in Seattle. After completing my one-year term, I’ve been reappointed as a CNCF Ambassador for another two-year serving term in 2024. It’s a privilege to be recognized and to continue being part of the global Cloud Native community, alongside 154 fellow Ambassadors from 37 countries and 124 companies. This journey has been both unforgettable and deeply meaningful to me.


    CPU vs GPU

    LLMs have ushered in a new era for GPU-based computing, powering both model training and inference, and paving the way for agentic AI. It feels like the CPU era faded almost overnight. I’ve worked on both sides, actually the founding engineer of two such products at Amazon: on the GPU side, Bedrock; and on the CPU side, quite a few, including App Runner, ECS/Fargate, Lambda, and Elastic Beanstalk.

    Few diff between GPU / Other accelerators vs CPU

  • Scale out vs Scale up: AI workloads are Scale up workloads. Larger models demand more compute. Science and algorithmic limitations decide simple scale out doesn't work.
  • Compute vs Memory Bound: Large model inference is two workloads: Prefill and Token genereation. Prefill needs a lot of computing resources to convert input into data structure that gets handed off to the next process. Token generation is that the model generates each token sequentially one at a time. And this puts a very different set of demands on AI infrastructure. Each time a token is generated, the entire model has to be read from memory, but only a small amount of compute is used. For this reason, token generation puts lots of demand on memory but only a small amount of compute, almost the exact opposite of the pre-fill workload. And for agentic AI scenario, customers care about fast prefill and really fast token generation.

  • Red Hat's tweets "Why not just scale LLMs like any other app?" provided an in-depth descriptions of those challenges. Dynamo (an open-source Inference Framework) introduced by Nvidia in last month GTC25 exactly aligns with the two-stage structure described above, offering high throughput and low latency. Similar frameworks like Red Hat's llm-d, LMSYS.org's SGLang are doing the same thing as well.

  • Per Request vs Per Token: Both types of compute adopt the concept of serverless, but differ behind. Take latency as an example. In the CPU world, low latency typically refers to the time from request income to completion. However, in large models world, it's measured a bit differently, for instance, using TTFT (Time to First Token) to capture the delay before the first token is generated. And when dealing with long input contexts from users, the overall processing time becomes an entirely separate challenge. Another example is pay-as-you-go.

  • Peter DeSantis had an excellent keynote in re:Invent 2024 that highlighted the diverse challenges for AI workloads. "One of cool things AI workloads is that they present a new opportunity for our teams to invent in entirely different ways." Peter said.

  • Scarcity: GPU / accelerators capacity is limited. Computing resources are highly limited for every organization, a major difference compared to CPU-based system, where resource constraints are typically less severe. Each company adopts its own strategic approach to address this challenge and many are making steady progress. "AI does not have to be as expensive as it is today, and it won’t be in the future. Chips are the biggest culprit. Most AI to date has been built on one chip provider. It’s pricey. Trainium should help.." Andy Jassy said in the latest letter to stakeholders.

  • Kubernetes Community Movement

    I had expected Kubernetes to move faster in this space. GPT-3.5 was introduced in late 2022, yet it wasn’t until mid-2024 that the community launched two relevant working groups: WG Serving and WG Accelerator Management, to address and enhance serving workloads on Kubernetes, specifically on hardware-accelerated AI/ML inference.


    Google Cloud Run on GPU

    Have to admit, Google Cloud Run made a right move. As prev builter of AWS App Runner, a product positioned similarly to Cloud Run, I'm excited to see Cloud Run now on GPUs, as announced at Google Cloud Next 2025. Serverless GPU support is a big deal, it enables Cloud Run to handle large models and opens the door to emerging opportunities in agentic AI, another big deal.


    Last Mile Delivery

    Currently, there's less discussion on lower-level engineering technologies compared to scientific research, even though there are a fair amount of innovations ongoing. Many underestimate the importance of this area, but it's actually the 'last mile' in delivering AI capabilities to end users. Research, engineering, and product are inseparable. The most pressing bottleneck right now lies in engineering—specifically in making models with high availability, high performance, and cost-efficient under today’s short compute resources.


    Key Primitives (or Building Blocks)

    The fundamentals of serverless remain unchanged. CNCF turns 10 now, Kubernetes, Lambda, ECS and Alexa turn 11. Bedrock and Claude turn 2. Someone says Bedrock is the "Lambda of LLM." I say it is more than that. As I putted in post during Serverless 10 year, serverless continues to play a key role in the LLM world, handling the heavy lifting and delivering real AI/ML value to customers. This principle has held true since before the 'Attention is All You Need' era.


    What Comes Next

    For 3+ years, I've served as a Program Committee member for the Open Source Summit and KubeCon + CloudNativeCon, and this year is no exception. After reviewing all the CFPs (Call for Proposals), it's clear that the theme for 2025 is 'Agentic AI', just as 2024 was all about 'LLM'. But what exactly is Agentic AI, and how can it be used to enhance productivity? There are many answers to that. Anthropic open-sourced MCP (Model Context Protocol), an open standard for connecting LLM applications with external data sources and tools, which has already gained popularity in the industry. Amazon introduced several innovations, including Nova Act, a new AI model trained to perform actions within a web browser, created by the Amazon AGI SF Lab (formerly Adept AI); the SWE-PolyBench, a multi-language benchmark for repository level evaluation of coding agents; and Strands Agents, an Open Source AI Agents SDK.

    From builder perspective, Firecracker seems back to the stage. Inspired by Jeff Barr’s post, it’s clear that Firecracker lightweight VMs are becoming a enabler option for AI coding assistants or more: Agentic AI, allowing users speed development and deployment while running code in protected sandboxes. Companies like E2B are also embracing this approach, providing safe environments for running AI-generated code. As prev builder of Fargate on Firecracker, I’m excited to see this happening.

    "The Urgency of Interpretability" (by Dario Amodei, CEO of Anthropic) is undeniable. Jason Clinton, CISO of Anthropic, warns that "fully AI employees are just a year away." Reading the research in "Tracing the thoughts of a large language model" is insightful. At the same time, these experiments also highlight the many opaque aspects of large models from a human perspective, underscoring the need for continued collaborative exploration and greater resource investment. Unless we clearly understand what happens behind the scenes after each prompt is processed and where the resources are being used, we can't effectively identify the right direction for optimization or the right tools to clean up resource usage. With that understanding, we can truly achieve high availability, high performance, high utilization, and cost efficiency for our LLM product.

    Future remains open, I’m an optimist. A new world, with a close collaboration between Product, Engineering, and Research you've never seen, is on the horizon. vLLM (a fast and easy-to-use library for LLM inference and serving), originally built at UC Berkeley and later donated to the LF AI & Data Foundation, is a typical example of a project that has grown into a community-driven effort with contributions from both academia and industry. And there will be more coming up.


    Hits