For the last few years at Solo.io, we have been rethinking service meshes from the ground up. A key part of this vision was for the service mesh to become “ambient mesh” – present everywhere, but faded into the background. Our engineers led the re-architecting of the core of Istio with its ambient mode, which has revitalised the project, delivering the most powerful, simple, and efficient service mesh engine, and a base on which to build great products.
Istio brings support for a single Kubernetes cluster, and Gloo Mesh brings support for added platforms like ECS, The obvious next step is making ambient mesh span multiple clusters. While multi-cluster support has existed in Istio for many years, we didn't want to just replicate the same solution and limitations in adding multi-cluster to Gloo Mesh. Instead, we built on our years of experience operating multi-cluster service meshes at scale to bring a generational improvement to the paradigm, bringing unprecedented levels of scale without the complexity.
Multi-cluster problems
Whether to enable separation of applications or environments, support multi-cloud strategies, or handle massive scale, as Kubernetes adoption in an ecosystem expands, enterprises increasingly turn to multiple clusters.
We've talked at length about the scaling problems imposed by the traditional sidecar architecture. Operating on a single cluster provides a natural ceiling on the scale a mesh can take on, due to intrinsic constraints on cluster sizes imposed by Kubernetes and other platforms. With multiple clusters, the sky is the limit on how many workloads can be in the mesh: traditional multi-cluster solutions take those problems and amplify them.
Ambient Mesh was built to resolve the fundamental issues with service mesh, and we have shown how simple operating an ambient mesh at scale is, as a result.
Ultimately, when designing a multi-cluster solution, some important considerations are:
- Reliability: how does the mesh function when a cluster becomes unavailable? Often, reliability is one of the top motivators for adopting a multi-cluster architecture in the first place, so ensuring the mesh itself is reliable is imperative to meet this use case.
- Scale: how many clusters can join the mesh, and at what cost? How large can each cluster be (in terms of services, pods, etc)?
- Complexity: how much effort is required to set up and maintain the environment?
- Policy management: how can we manage traffic between clusters? For example, can we provide authentication and authorization across clusters? Can we customize how traffic flows between clusters, to tune our traffic patterns for cost (avoiding cross-region charges, for instance) or reliability?
While the existing Istio sidecar multi-cluster support solves these problems, it doesn't do a particularly good job of doing so. The architecture relies on each Istio control plane connecting to every cluster's API server, imposing inherent limits on scale and reliability:
- As the mesh grows, this O(n2) behavior quickly amplifies and can start to blow up with only a handful of large clusters joined together.
- Exposing each API server can be a substantial operational burden, as well as a security concern.
- When a cluster's API server is unavailable (which is not uncommon during upgrades or other scenarios), routing to that cluster may fail.
Multi-cluster 2.0
We are delighted to add support for ambient multi-cluster in Gloo Mesh. This new enterprise solution provides not only the simplicity of ambient mesh that our users have come to love to multi-cluster environments, but also unprecedented scale. With this new architecture, we are confident that Gloo Mesh users can scale ambient mode to the largest deployments on the planet.
- Reliability: the new architecture drops the dependency on remote API servers and is designed to seamlessly handle temporary downtime between clusters.
- Scale: however large your environment is, Gloo Mesh can handle it - period. The new architecture is multiple orders of magnitude more scalable than any existing multi-cluster service mesh solution and can scale beyond 1 million pods. Don't want to take our word for it? Keep reading for some scale tests below!
- Complexity: onboarding a cluster to the mesh is now a single-step process. Say goodbye to copying
kubeconfigs
around with sensitive credentials!
To put this new deployment to the test, we designed a test environment comparing the traditional Istio sidecar multi-cluster approach with our new ambient multi-cluster mode. Each cluster hosts a relatively small set of resources: 50 services with 50 pods each (2500 pods total), with 1 pod change per second, across 50 nodes.
Here is a comparison of the control plane CPU utilization when scaling from a single cluster to 10 connected clusters:
We can see that while in sidecar mode Istiod maxes out at 16 cores (utilizing the full machine) even after the load is complete, in ambient mode we hover around 10% of a core, and never exceed 30%!
Bigger scales are no problem, either. Here is the same test, extended from 10 connected clusters to 1000 connected clusters. This brings the total connected pods to 2.5 Million:
After some initial load during the bootstrapping of the cluster connectivity, control plane CPU drops back down to incredibly low utilization.
Stay tuned for a future post where we will push ambient multi-cluster to its limits — far beyond the scales tested here.
Get started with multi-cluster Ambient Mesh
For the first time, your service mesh can keep up with the scale of your estate - no matter the size.
Learn more in our documentation and contact us to get started.