Istio Ambient Mesh is the new Istio architecture introduced in the Istio community on September 7th with initial contributions from Solo and Google engineers. Read the full announcement on Istio’s website. Sidecars have been a staple of Istio’s architecture since day one and are responsible for the majority of features available in Istio today. However, sidecars require the injection of an additional container to each Kubernetes Pod resource, introduce a computational overhead to application traffic, and require the provisioning of additional cluster resources to pull everything off. This blog will explore how Istio Ambient Mesh was designed to reduce the service mesh infrastructure resources typically associated with sidecars.
What Ambient Means for Resource Usage
Ambient mesh was designed to minimize resource requirements for users in their Kubernetes clusters. To explain how ambient does this, we must first clarify allocation vs utilization. When deploying a Kubernetes cluster on a hosted environment, users must determine how many nodes their cluster will have, and how many vCPU cores and how much memory to allocate for each node. The allocation of these resources factor into the overall cost for the Kubernetes cluster. The pods and services deployed to the cluster may only utilize a percentage of the total allocated resources, but at the end of the day, what gets allocated is what ultimately determines cost for the cluster. This is where ambient takes the stage.
Ambient introduces two new components to Istio: ztunnels and waypoint proxies. Ztunnels enable the most basic configurations for ambient mesh. They provide L4 features such as mTLS, telemetry, authentication and L4 authorizations and run as a DaemonSet on each node of the Kubernetes cluster. Waypoint proxies provide L7 mesh features such as VirtualService routing, L7 telemetry and L7 authorizations policies and are deployed at the namespace level per ServiceAccount. These ztunnels and waypoint proxies work together to replace sidecars in the Istio Service Mesh.
To put this in perspective, let’s take a look at the classic bookinfo application. In this example, there are four different services: productage, details, ratings and reviews. Deploying the application results in six pods deployed to the cluster, one for each service and three for reviews service since there are different versions. In the standard sidecar model, Istio adds one istio-proxy sidecar for every pod. Each sidecar container is configured with requested CPU and memory values and corresponding limits for each of these values. These numbers, along with those from users’ own services, help determine what number of vCPU cores and memory to select when creating a cluster. With ambient, for a three node cluster, there will be one ztunnel per node, and only one waypoint proxy since VirtualServices are needed to control routing to different versions of the reviews service. In this example, the sidecar model requires six additional containers and ambient only four. Not a significant amount of savings, but what if each bookinfo service scales to more than one instance?
When bookinfo scales up to three instances per service, 18 sidecars are required but still only 4 containers with ambient. Here is where we start to see the savings. So let’s put it to the test.
Testing It Out
For these tests, we used a series of scripts utilizing fortio to drive some traffic through the Istio Service Mesh. These scripts have been pushed to GitHub so feel free to check them out here. The test deploys one fortio client instance and three different versions of the httpbin service each scaled to 10 replicas. The fortio client will send requests to version 1 of httpbin for a few minutes, repeats the same for version 2, and finally version 3. For tracking CPU and memory usage throughout the following tests, versions of Prometheus, node-exporter, and Grafana are installed. A custom Grafana dashboard was created for observing relevant data, which can be found and imported from GitHub here.
The tests will deploy Istio in several scenarios and see how resource consumption varies between the runs. The scenarios are:
- Istio with sidecar
- Ambient with L4 ztunnel only
- Ambient with L4 ztunnel and L7 waypoint proxies
Throughout the tests, the Grafana dashboard will be watching a variety of CPU and memory metrics reported by Prometheus.
Grafana will observe the CPU and memory usage of the aforementioned Istio resources. One CPU reported by Kubernetes is equivalent to 1 vCPU/core for cloud providers, and 1 hyper-thread on bare-metal Intel processors. Regarding memory, we are leveraging the working set of bytes used by containers, that excludes inactive file usage, so inactive cache is not included in the metric.
Total Sidecar/Ambient CPU per Pod tracks the rate of cumulative CPU time consumed in seconds filtered containers matching the naming conventions for sidecar, ztunnel and waypoint proxy. Total Sidecar/Ambient RAM per Pod does the same except for memory, via the working set of bytes reported in a given instant. Max Sidecar/Ambient CPU and RAM are the same queries as above but track the max_over_time during each of the test runs to capture the highest value during the test. Last are the Total CPU and RAM of Workload panels which compares usage of the entire test namespace versus usage of just the Istio dataplane workloads (sidecars, ztunnels, waypoint proxies). So let’s compare some numbers.
Istio Ambient Mesh Analysis
Captures of CPU and Memory by pod for sidecar and ambient pods for three scenarios
Let’s start by looking at CPU usage by pod. In the sidecar scenarios, the container utilizing the most CPU resources is the fortio client sidecar istio-proxy which is responsible for sending traffic to all pods during the test, and the httpbin server istio-proxy containers consume very little during the incoming requests. Looking at the ambient scenario with the L4 ztunnel only, each ztunnel instance sees small spikes as they handle cross node traffic for the different requests. In the ambient with both ztunnel and waypoint proxy scenario, waypoint proxies consume similar values to the typical sidecar.
Next is memory usage by pod. In all istio scenarios, memory usage stays relatively constant for each pod during the test runs. In ambient scenarios, the L4 ztunnel proxy consumes the most and waypoint proxies once again consume a similar amount of resources as sidecars do. So far we are not seeing anything that suggests ambient is going to save users any infrastructure costs – but let’s take a look at the total usages across the cluster.
Captures of total CPU and Memory comparing total workload to Istio dataplane usage for three scenarios
Captures of CPU and Memory by pod for sidecar and ambient pods in stacked view for three scenarios
Looking at total CPU and memory utilization, we have to remember that in sidecar scenarios there are 31 sidecar containers required (one client and 30 servers) while in ambient only three ztunnel containers and three waypoint proxies are required. That means 28 fewer containers in the ztunnel scenario and 25 less in ztunnel with waypoint proxy scenario. The stacked graphs of the CPU and Memory by Pod are excellent at highlighting just how many additional containers are present between scenarios. Memory usage of Istio dataplane workloads in the ambient scenarios use just 25%-33% of what is used in sidecar scenarios. Looking at CPU, ambient does have an initial CPU spike from the L4 ztunnels, but average CPU usage of ambient scenarios is only 20% of what’s needed on average in the sidecar scenario.
Going further, we need to account for an item raised earlier in this blog: utilization versus allocation. The metrics accounted for so far in Grafana have only covered utilization. Every sidecar resource has a default request of 100 millicores vCPU and 128Mi memory, as well as limits set for 2 vCPU’s and 1Gi memory. In the sidecar scenario there are 31 sidecars alone, just looking at the requested values alone, the Istio dataplane requirements for the cluster would require 3.1 vCPU cores and 3.9Gi memory.
Assuming ztunnels and waypoint proxies have similar requests and limits, ambient with L4 needs only 300 millicore vCPU and 384Mi memory for the dataplane. Ambient with waypoint proxies needs an additional 300 millicores vCPU and 384Mi memory, bringing their totals to 600 and 768 respectively. That’s over a 75% reduction in resource requirements between ambient and sidecar.
Conclusion
Ka-ching. These results were collected with a very early version of ambient and it is still under active development. Ambient service mesh’s goal of reducing infrastructure costs is bearing fruit and setting a solid foot forward on its roadmap to production readiness. These early numbers suggest users could cut their required resource requirements by as much as 75% – especially if users only require an L4 mesh. It should be mentioned, ambient service mesh does not mean the end of sidecars. In fact ambient mesh today already supports interoperability between ambient mode and sidecar based Istio. There will always be dedicated use cases where sidecars remain a good choice, but ambient mesh aims to be the best option for many users going forward.
Learn More About Istio Ambient Mesh
Check out these resources to learn more:
- Announcing Istio Ambient Mesh by Idit Levine – Solo.io
- Introducing Ambient Mesh article from John Howard – Google, Ethan J. Jackson – Google, Yuval Kohavi – Solo.io, Idit Levine – Solo.io, Justin Pettit – Google, Lin Sun – Solo.io
- Get Started with Ambient Mesh guide by Lin Sun – Solo.io, John Howard – Google
- Ambient Mesh Security Deep Dive article by Ethan Jackson – Google, Yuval Kohavi – Solo.io, Justin Pettit – Google, Christian Posta – Solo.io
- On demand workshop: Get Started with Istio Ambient Mesh (with Ambient Mesh Foundation Certification)
- The Cloudcast podcast with Louis Ryan – Google, Christian Posta – Solo.io