CNCF 10 Years: CPU Native vs GPU Native

MAKE CLOUD NATIVE UBIQUITOUS




CNCF (Cloud Native Computing Foundation) turns 10. I published a post when Kubernetes turns 10 in 2024. It's the same sentiment. It’s been a magic journey to be part of these programs for half of their ten-year history. I am honored and humbled for the opportunity to help various roles, such as Local Community Lead, Ambassador, Program Committee Member, and Speaker.

Motivated by Janet Kuo's presentation, 'Kubernetes at 10: A Decade of Community-Powered Innovation,' from the KuberTENes Birthday Bash at Google Mountain View Bay View office, along with the event’s T-shirt design. List my ten KubeCon:

  • KubeCon + CloudNativeCon North America Salt Lake City, Utah 2024 (Program Committee Member)
  • KubeCon + CloudNativeCon India Delhi 2024 (Program Committee Member)
  • KubeCon + CloudNativeCon + Open Source Summit China Hong Kong 2024 (Program Committee Member)
  • KubeCon + CloudNativeCon Europe Paris, France 2024 (Program Committee Member)
  • KubeCon + CloudNativeCon North America Chicago, IL 2023 (Program Committee Member)
  • KubeCon + CloudNativeCon Europe Amsterdam, Netherlands 2023 (Program Committee Member)
  • KubeCon + CloudNativeCon North America Detroit, MI 2022 (Speaker)
  • KubeCon + CloudNativeCon North America LA, CA / Virtual 2021
  • KubeCon + CloudNativeCon North America 2020 Virtual
  • KubeCon + CloudNativeCon North America San Diego, CA 2019 (First KubeCon)

  • Naming as CNCF Ambassador

    (KubeCon EU 2023 Amsterdam, Netherlands. Keukenhof, known as the Garden of Europe, one of the world's largest flower gardens)

    I was first named a CNCF Ambassador in late 2022, with the announcement made public during KubeCon Europe 2023 in Amsterdam. I remember at the time, only three Ambassadors were selected from Amazon, one of them was my mentor and colleague, a Principal Engineer in the same VP organization. And I was the only one based in Seattle. After completing my one-year term, I’ve been reappointed as a CNCF Ambassador for another two-year serving term in 2024. It’s a privilege to be recognized and to continue being part of the global Cloud Native community, alongside 154 fellow Ambassadors from 37 countries and 124 companies. This journey has been both unforgettable and deeply meaningful to me.


    CPU vs GPU

    LLMs have ushered in a new era for GPU-based computing, powering both model training and inference, and paving the way for agentic AI. It feels like the CPU era faded almost overnight. I’ve worked on both sides, actually the founding engineer of two such products at Amazon: on the GPU side, Bedrock; and on the CPU side, quite a few, including App Runner, ECS/Fargate, Lambda, and Elastic Beanstalk.

    Few diff between GPU / Other accelerators vs CPU

  • Scale out vs Scale up: AI workloads are Scale up workloads. Larger models demand more compute. Science and algorithmic limitations decide simple scale out doesn't work.
  • Compute vs Memory Bound: Large model inference is two workloads: Prefill and Token genereation. Prefill needs a lot of computing resources to convert input into data structure that gets handed off to the next process. Token generation is that the model generates each token sequentially one at a time. And this puts a very different set of demands on AI infrastructure. Each time a token is generated, the entire model has to be read from memory, but only a small amount of compute is used. For this reason, token generation puts lots of demand on memory but only a small amount of compute, almost the exact opposite of the pre-fill workload. And for agentic AI scenario, customers care about fast prefill and really fast token generation.
  • Per Request vs Per Token: Both types of compute adopt the concept of serverless, but differ behind. Take latency as an example. In the CPU world, low latency typically refers to the time from request income to completion. However, in large models world, it's measured a bit differently, for instance, using TTFT (Time to First Token) to capture the delay before the first token is generated. And when dealing with long input contexts from users, the overall processing time becomes an entirely separate challenge. Another example is pay-as-you-go.
  • Scarcity: GPU / accelerators capacity is limited. Though Andy Jassy said in 2024 letter to stakeholders "AI does not have to be as expensive as it is today, and it won’t be in the future. Chips are the biggest culprit. Most AI to date has been built on one chip provider. It’s pricey. Trainium should help.." In reality, computing resources are extremely limited for every individual organization.

  • Peter DeSantis had an excellent keynote in re:Invent 2024 that highlighted the diverse challenges for AI workloads. "One of cool things AI workloads is that they present a new opportunity for our teams to invent in entirely different ways." Peter said.


    Kubernetes Community Movement

    I had expected Kubernetes to move faster in this space. GPT-3.5 was introduced in late 2022, yet it wasn’t until mid-2024 that the community launched two relevant working groups: WG Serving and WG Accelerator Management, to address and enhance serving workloads on Kubernetes, specifically on hardware-accelerated AI/ML inference.


    Google Cloud Run on GPU

    Have to admit, Google Cloud Run made a right move. As prev builter of AWS App Runner, a product positioned similarly to Cloud Run, I'm excited to see Cloud Run now on GPUs, as announced at Google Cloud Next 2025. Serverless GPU support is a big deal, it enables Cloud Run to handle large models and opens the door to emerging opportunities in agentic AI, another big deal.


    Key Primitives (or Building Blocks)

    The fundamentals of serverless remain unchanged. CNCF turns 10 now, Kubernetes, Lambda, ECS and Alexa turns 10 last year. Bedrock and Claude turns 2. Someone says Bedrock is the "Lambda of LLM." I say it is more than that. As I putted in post during Serverless 10 year, serverless continues to play a key role in the LLM world, handling the heavy lifting and delivering real AI/ML value to customers. This principle has held true since before the 'Attention is All You Need' era.


    What Comes Next

    Firecracker seems back to the stage. Inspired by Jeff Barr’s post, it’s clear that Firecracker lightweight VMs are becoming a enabler option for AI coding assistants or more: agentic AI, allowing users speed development and deployment while running code in protected sandboxes. Companies like E2B are also embracing this approach, providing safe environments for running AI-generated code. As prev builder of Fargate on Firecracker, I’m excited to see this happening.


    Hits