I was a guest of the Cloud Native Computing Foundation (CNCF) at its EU KubeCon conference in London the first week of April. Most of my conversations with the vendors at the event can be grouped under three main themes: multi-cluster management, AI workloads, and reducing Kubernetes costs on the cloud.

Multi-cluster management

Running Kubernetes clusters becomes a challenge once the number of clusters grows unwieldy. In large enterprises with applications running 100s of clusters at scale, there is a need for multi-cluster management (also called fleet management), which is fast becoming a focus in the cloud native vendor community. These solutions provide a unified dashboard to manage clusters across multi-cloud environments, public and private, providing visibility into what can turn into cluster sprawl, and apply FinOps to optimize costs. They help DevOps teams manage scalability and workloads, as well as play a part in high availability and disaster recovery by replicating workloads across clusters in different regions, for example. DevOps CI/CD and platform engineering become essential to manage large numbers of clusters.

I spoke with several vendors at KubeCon who are addressing this challenge, e.g., SUSE is launching a Rancher multi-cluster management feature for EKS. 

Mirantis is also tuning into this trend seeing cluster growth across distributed systems at the edge, regulatory control with need for sovereign cloud and separation of data, and hybrid cloud, all leading to better multi-cluster management. To address this Mirantis launched k0rdent in Feb 2025, an open source Kubernetes-native distributed container management solution that can run on public clouds, on-premises, and at the edge, offering unified management for multiple Kubernetes clusters. Some k0rdent key features are declarative configuration making it easy to scale out, observability with FinOps to control costs, and a services manager to enable services built on top of the solution. Mirantis recognizes how Kubernetes has matured to become a de facto cloud native standard across multi-cloud environments which allow the cloud agnostic solutions from Mirantis to provide portability across multiple environments.

Mirantis’s commitment to open source was reinforced with its k0s (edge) Kubernetes and k0smotron multi-cluster management tool joining the CNCF Sandbox projects. k0rdent is built on top of these foundation projects and goes beyond the basic cluster management in K0smotron.

Amazon EKS Hybrid Nodes launched at AWS re:Invent 2024 allows existing on-premises and edge infrastructure to be used as nodes in Amazon EKS clusters, unifying Kubernetes management across different environments. This partners with Amazon EKS Anywhere which is designed for disconnected environments, whereas with EKS Hybrid Nodes it is possible to have connectivity and a fully managed Kubernetes control plane across all environments. A use case is enabling customers to augment their AWS GPU capacity with preexisting GPU investments on-premises. 

So, looking at AWS’s edge options: EKS Anywhere is fully disconnected from the cloud and the Kubernetes control plane is managed by the customer; EKS Hybrid Node offers on-premises infrastructure and the Kubernetes control plane is managed by AWS; finally, AWS Outposts has the control plane and the infrastructure all controlled by AWS.

I spoke with Kevin Wang, lead of cloud native open source team at Huawei, co-founder of multiple CNCF projects: KubeEdge, Karmada, and Volcano. Kevin pointed out that Huawei has been contributing to Kubernetes from its earliest days and that it’s vision has always been to work with open standards. Karmada (incubating CNCF project) is an open, multi-cloud, multi-cluster Kubernetes orchestration system for running cloud native applications across multiple Kubernetes clusters and clouds. Key features include centralized multi-cloud management, high availability, failure recovery, and traffic scheduling. Example use cases of Karmada include trip.com which used Karmada to build a control plane for a hybrid multi-cloud, reducing migration costs across heterogeneous environments, and Australian Institute for Machine Learning uses Karmada to manage edge clusters alongside GPU-enabled clusters, ensuring compatibility with diverse compute resources. 

VMware’s solution for multi-cluster Kubernetes environments has been re-branded VMware vSphere Kubernetes Service (VKS), formerly known as VMware Tanzu Kubernetes Grid (TKG) Service, which is a core component of the VMware Cloud Foundation. VMware offers two approaches to running cloud native workloads: via Kubernetes and via Cloud Foundry. Perhaps confusingly, Cloud Foundry has the Korifi project which provides a Cloud Foundry experience on top of Kubernetes and which also underpins VMware Tanzu Platform for Cloud Foundry. The point of VMware offering two strands, is that the Kubernetes experience is for DevOps/platform engineers familiar with that eco-system, whereas the Cloud Foundry experience is more opinionated but with a user friendly interface.

I met with startup Spectro Cloud, launched in 2020 and now 250 strong, it was co-founded by serial tech entrepreneurs. Spectro Cloud offers an enterprise-grade Kubernetes management platform called Palette, for simplifying at scale the full lifecycle of Kubernetes clusters across diverse environments: public clouds, private clouds, bare metal, and edge locations. Key features are: declarative multi-cluster Kubernetes management, and a unified platform for containers, VMs, and edge AI. Palette EdgeAI offers a lightweight Kubernetes optimized for AI workloads. Users can manage thousands of clusters with Palette, which is decentralized so there are no costly management servers or regional instances, Palette enforces each cluster policy locally. To manage thousands of clusters Palette operates not in the Kubernetes control plane, but in a management plane that sits above it. On the edge Spectro Cloud leverages CNCF project Kairos. Kairos transforms existing Linux distributions into immutable, secure, and declaratively managed OS images that are optimized for cloud-native infrastructure.

Palette lets users choose over 50 best of breed components when deploying stacks, from Kubernetes distributions to CI/CD tools and service meshes and these packs are validated and supported for compatibility. Containers and VMs are supported out-of-the-box with little user configuration. Palette uses a customized version of the open source Kubernetes, named Palette eXtended Kubernetes, as default, but Spectro Cloud supports multiple common Kubernetes distros (RKE2, k3s, microk8s, cloud-managed services), and customers don’t need to configure these on their own. Furthermore, Spectro Cloud points out it is distro-agnostic, adopting distros based on customer demand. With half of Spectro Cloud’s business coming from the edge, it is making edge computing more practicable for AI workloads.

AI workloads and the key role of the Model Context Protocol

AI workloads will grow to become a major part of the compute traffic in an enterprise, and the cloud native community is turning to making this transition as seamless as possible. A challenge is how to navigate the complexities of connecting multiple AI agents with other tools and systems. There is a need for tool discovery and integration, a unified registry, the challenge of connectivity and multiplexing, and security and governance. 

Anthropic created and open sourced a standard for AI agents to discover and interact with external tools by defining how they describe their capabilities and how agents can invoke them, called Model Context Protocol (MCP). 

Solo.io, a cloud native vendor, presented at KubeCon their evolution of MCP called MCP Gateway, which is built on their API gateway called kgateway (formerly Gloo). With tools adopting this standard, MCP Gateway provides a centralized point for integrating and governing AI agents across toolchains. MCP Gateway virtualizes multiple MCP tools and servers into a unified, secure access layer, providing AI developers with a single endpoint to interact with a wide range of tools, considerably simplifying and aiding agentic AI application development. Additional key features include: automated discovery and registration of MCP tool servers; a central registry of MCP tools across diverse environments; MCP multiplexing, allowing access to multiple MCP tools via a single endpoint; enhanced security with the MCP Gateway providing authentication and authorization controls, and ensuring secure interaction between AI agents and tools; improved observability of AI agent and tools performance through centralized metrics, logging, and tracing.   

Furthermore, Solo.io sees MCP Gateway as laying the foundation for an agent mesh, an infrastructure layer for networking across AI agents, such as agent-to-LLM, agent-to-tool, and agent-to-agent communication. 

Continuing on the theme of AI security , working with enterprise AI applications carries two critical risks: first, compliance with regulations in the local jurisdiction, for example in the EU with GDPR and the EU AI Act. And second, how to treat company confidential data, for example putting sensitive data in a SaaS based AI application puts that data out on the cloud and leaves the potential for it to leak out. 

One approach to reducing these risks is taken by SUSE, its SUSE AI is a secure, private, and enterprise-grade AI platform for deploying generative AI (genAI) applications. Delivered as a modular solution, users can use the features they need and also extend it. This scalable platform also provides the insights customers need to run and optimize their genAI apps.

Huawei is involved in the CNCF projects to manage AI workloads, such as Kubeflow. Kubeflow started out as a machine learning lifecycle management system, orchestrating the pipeline for ML workloads across the lifecycle, from development through to production. It has since evolved to manage LLM workloads, leveraging Kubernetes for distributed training across large clusters of GPUs, providing fault tolerance, and managing inter-process communications. Other features include model serving at scale with KServe (initially developed as part of the KFServing project within Kubeflow, KServe is in the Linux AI Foundation but there is talk of moving it into CNCF), offering autoscaling of AI traffic loads, and performing optimization such as model weight quantization that reduces memory footprint and enhances speed. Huawei is also a co-founder of the Volcano project for batch scheduling AI workloads across multiple pods considering inter-dependencies, so that workloads are scheduled in the correct order.

Looking at longer term research, Huawei is working on how AI workloads interact in production, with applications running at the edge and in robots, and how machines communicate with humans and with other machines, and how this scales, for example, across a fleet of robots working in a warehouse for route planning and collision avoidance. This work falls within the scope of KubeEdge (incubating CNCF project), an open source edge computing framework for extending Kubernetes to edge devices, addressing the challenges of resource constraints, intermittent connectivity, and distributed infrastructure. A part of this research falls under Sedna, an “edge-cloud synergy AI project” running within KubeEdge. Sedna enables collaborative training and inference, integrating seamlessly with existing AI tools such as TensorFlow, PyTorch, PaddlePaddle, and MindSpore.

Red Hat is exploiting AI in its tools, for example it released version 0.1 of Konveyor AI for using LLMs and static code analysis to support upgrading existing/legacy applications and is part of Konveyor (a sandbox CNCF project), an accelerator for the modernization and migration of legacy applications to Kubernetes and cloud-native environments. In the Red Hat OpenShift console there is now a virtual AI assistant called OpenShift Lightspeed for users to interact with OpenShift using natural language, and it is trained on the user’s data, so it has accurate context. To support AI workloads, there is OpenShift AI for developing, deploying, and managing AI workloads across hybrid cloud environments.

VMware is supporting AI workloads at the infrastructure layer with VMware Private AI Foundation (built on VMware Cloud Foundation, the VMware private cloud), ensuring databases for RAG and storage are available, but also rolling up all the components that are needed for running AI workloads on Kubernetes, automating the deployment, making it easy for users. This offering is in partnership with Nvidia and includes its NeMo framework, for building, fine-tuning, and deploying generative AI models, and supports NVIDIA GPUs and NVIDIA NIM for optimized inference on a range of LLMs.

Managing Kubernetes costs on the cloud

Zesty, a startup launched in 2019, has found ways of reducing costs running Kubernetes on the cloud, making use of Kubernetes’s connections to the cloud provider. Once installed in a cluster, Zesty Kompass can perform pod right-sizing, where it tracks the CPU, memory, server, storage volume usage and dynamically adjusts these, up or down to the needs of the workloads. Zesty finds users provision too much capacity for the need of the workloads actually run and adjusting these capacities is not easy to perform dynamically. Most companies keep a buffer of servers in readiness for spike demands, so Zesty puts these excess servers into hibernation, which reduces the cost of keeping these servers considerably lower. Zesty Kompass can also help users exploit spot instances on their chosen cloud. The solution runs inside a cluster to maintain the best security level, and typically multiple clusters are deployed to maintain segregation, however, by installing Kompass in multiple clusters, its dashboard provides a global view of Kompass activity inside each cluster it is deployed in. Most recently Zesty announced that Kompass now includes full pod scaling capabilities, with the addition of Vertical Pod Autoscaler (VPA) alongside the existing Horizontal Pod Autoscaler (HPA).

Amazon EKS Auto Mode (launched at AWS re:Invent 2024) is built on open source project Karpenter. Karpenter manages a node lifecycle within Kubernetes, reducing costs by automatically provisioning nodes (up and down) based on scheduling needs of pods. When deploying workloads the user specifies the scheduling constraints in the pod specifications, Karpenter uses these to manage provisioning. With EKS Auto Mode, management of Kubernetes clusters is simplified, letting AWS manage cluster infrastructure, such as compute autoscaling, pod and service networking, application load balancing, cluster DNS, block storage, and GPU support. Auto Mode also leverages EC2 managed instances, which enables EKS to take on the shared responsibility ownership and security of the cluster compute where applications need to run.

Talking with the AWS team at KubeCon it emerged that AWS is the host cloud for the Kubernetes project at CNCF, which it offers at no cost to CNCF, so a nice contribution to open source from Amazon.

Launched in 2019, LoftLabs is the vendor that brought virtual clusters to Kubernetes, the company is now 60 strong. With virtual clusters organizations can run fewer physical clusters and within a cluster the use of virtual clusters creates better management of team resources than namespaces. A recent press release on its customer, Aussie Broadband, says that development teams could deploy clusters on-demand in under 45 secs. The customer estimates it saved 2.4k hours of dev time per year and £180k reduction in provisioning costs per year. At KubeCon it launched a new product, vNode, which provides a more granular isolation of workloads running inside vClusters. This approach enhances multi-tenancy through improved resource allocation and isolation within the virtualized environments. Since a virtual node is mapped to a non-privileged user, privileged workloads are isolated and can access resources such as storage that are available on the virtual cluster.

Cloud Native Buildpacks offer improved security

I spoke with the Cloud Foundry team, which mentioned that its CI/CD tool, Concourse, has joined CNCF projects, and that Cloud Foundry is a prominent adopter of Cloud Native Buildpacks, which it described as the hidden gem within CNCF. Buildpacks transform application source code into container images, including all the necessary dependencies. An example used by Kubernetes is kpack, and an advantage is that they do away with the need for Dockerfiles. While Docker was transformational in the evolution of cloud native computing, it is not open source, which creates an anomaly within CNCF. Supply chain security is not dealt with in Dockerfiles, and there is a growing demand for greater transparency and openness so as to reduce security risks. Buildpacks have been evolving to address these security concerns, with for example a software bill of materials. Buildpacks were first conceived by Heroku in 2011, adopted by Cloud Foundry and others, and then the open source Cloud Native Buildpacks project joined CNCF in 2018, with graduate status expected in 2026. 

Observability company Dash0 was founded in 2023 by CEO Mirko Novakovic, to perform tracing, logging, metrics and alerts, and whose previous observability company, Instana, was sold to IBM in 2002. Dash0 is built from the ground up around the OpenTelemetry standard, this means there is no vendor lock-in of the telemetry data, which remains in an open, standardized format. It makes use of OpenTelemetry’s semantic conventions to add context to data, and it supports the OpenTelemetry’s collector, a central point for receiving, processing and forwarding telemetry data. Designed to make the developer experience with observability easy, it has cost transparency, and a telemetry spam filter where logs, traces and metrics that are not needed are removed. Mirko’s approach is that since you are looking for a needle in a haystack, first make the haystack as small as possible, and this is where AI is used. 

The search space is reduced by not examining logs that have already been processed and show normal behavior. Then Dash0 uses an LLM based AI to enhance the data by structuring it, and then it will recognize error codes and drill down further to triage the error source and identify its possible origins. Mirko does not call this root-cause-analysis, because this term has been overused and resulted in loss of confidence due to false positives. Instead Dash0’s triage feature will give the most likely cause of the error as its first choice, but also provide probable alternatives, this means the developer has material to hunt down and isolate the root cause.

Dash0 finds foundation LLMs can be accurate without requiring additional finetuning or Retrieval Augmented Generation and uses more than one LLM to cross check results and reduce hallucinations. 

I spoke with Benjamin Brial, CEO and founder of Cycloid, which provides a Kubernetes sustainable platform engineering solution to streamline DevOps, hybrid/multi-cloud adoption, and software delivery. It has established enterprise clients like Orange Business Services, Siemens, Valiantys, and Hotel Spider, and contributes to open-source with tools like TerraCognita and InfraMap. Digital sovereignty and sustainability are two key missions for the company, which operates in the EU and North America. It reduces costs by presenting to the developer only the tools/features they need. Cycloid emphasizes sustainability through FinOps and GreenOps. It offers a centralized view of cloud costs across providers in a single panel, and it tracks cloud carbon footprint to minimize environmental impact, addressing cloud resource waste. With digital sovereignty becoming more important in the current geopolitical climate, Cycloid with its base in Paris, leverages its European roots to address regional sovereignty concerns, partnering with local and global players like Orange Business Services and Arrow Electronics to deliver solutions tailored to the European market. 

Cycloid uses a plugin framework to integrate any third-party tool. It also embeds open source tools in its solution such as TerraCognita (for importing infrastructure into IaC), TerraCost (for cost estimation) and InfraMap (for visualizing infrastructure). These tools enable organizations to reverse engineer and manage their infrastructure without dependency on proprietary systems, a key aspect of digital sovereignty. Cycloid gives enterprises the freedom to select the right tools for each process, maintain self-hosted solutions, and embed any kind of automation such as Terraform, Ansible, and Helm to deploy IaaS, PaaS, or containers, which is critical for retaining control over data and operations.