Using Istio in Ambient Mode - Do more for less!

August 23, 2024
Alex Ly
Craig Box
Daneyon Hansen

With the introduction of Istio’s ambient data plane mode, platform teams can efficiently adopt service mesh features and offer enhanced functionality with minimal resource and overhead impact to end consumers.

What is Istio Ambient Mode?

Istio’s ambient mode was first announced in September 2022. It is a new data plane mode — without sidecars — designed for simplified operations, broader application compatibility, and reduced infrastructure cost. Ambient mode splits Istio’s functionality into two distinct layers: the zero-trust secure overlay layer (the ztunnel), and the optional Layer 7 processing layer (the waypoint). Compared with sidecars, the layered approach allows users to adopt Istio incrementally from no mesh, to mTLS-based zero-trust overlay, to full L7 processing, as needed. This gives service mesh users two outstanding options from the same dedicated community: Istio with its traditional sidecar approach, or sidecarless ambient mode.

In this blog, we will explore example scenarios related to resourcing costs and operational challenges that often create tension when adopting a service mesh. We will also discuss how adopting ambient mode can address these challenges and improve the platform’s Total Cost of Ownership (TCO).

You can learn more information about ‘Ambient Mode’ in Istio here.

Istio’s Ambient Mode Reduces Resource Costs

By default, Istio recommends allocating 0.1 CPU cores and 128MB of memory per sidecar. Whilst the advantages of service mesh are well understood, the resource requirements can accumulate quickly for real production workloads.

For teams with a tight budget, complying with this requirement presents significant opportunity costs. They face trade-offs such as deciding between hiring additional staff or investing in other areas of the business versus bearing the increased cost of resources for the application becomes a critical consideration.

In your research about the service mesh, you will have learned about the array of Layer 7 capabilities, including:

  • Advanced traffic management and routing control
  • Fine-grained security policies at the application layer
  • Efficient handling of circuit breaking and fault tolerance mechanisms
  • Facilitated implementation of service discovery and dynamic service routing
  • Simplified deployment of A/B testing and canary release strategies

However, what if you don’t need these capabilities in the near-term, or for every service you operate? These features might be considered a value-add, and may not match your level of organizational need or maturity.

Let’s consider a few challenging scenarios related to the cost of implementing a service mesh:

Scenario 1: Control spiraling sidecars costs

“Our security team has mandated a zero-trust posture across the entire organization, and as a result we are looking to adopt a service mesh. However, we have discovered that the sidecar approach will incur additional cost (in resource reservations per application)”

In order to comply with the zero-trust mandate from the security team, with Istio’s ambient mode you no longer need to adopt a sidecar per application. Instead, you can leverage the shared, per-node ztunnel component, which handles the responsibilities of zero-trust networking. You can then opt into safe, per-namespace Layer 7 policy handling when you are ready, or just for the services for which it is needed. This opt-in approach means that the incremental cost from non-mTLS to adopting mTLS is now much lower.

Scenario 2: Reduce replicated resourcing costs

“Our organization requires that development, testing, and staging environments replicate the production setup to catch issues early and maintain high-quality releases. Adopting a service mesh means that each environment incurs additional resource and operational overhead, leading to ballooning overall costs”

We often see customer environments that replicate their production setup, where the lower level environments receive only a small amount of synthetic traffic — or in some cases no traffic at all. This represents a challenge for environment owners to justify the value of running service mesh in these environments with low or no traffic. The resources allocated to sidecar proxies are largely wasted, contributing to inefficient utilization and higher expenses. By removing the sidecar requirement, ambient mode reduces the base resource requirements, making it more cost-efficient to implement zero-trust while replicating production environments as the standard across the entire development lifecycle.

Scenario 3: Navigating varying velocity patterns across architectures

“We have a complex microservice architecture where different services experience widely varying traffic patterns. Some services handle thousands of requests per second (RPS), while others only handle a few hundred. Despite this, the sidecar proxies are generally allocated the same resources across all services, leading to inefficient resource utilization. High-traffic services suffer from performance bottlenecks, while low-traffic services waste resources.”

Another common scenario we see is that sidecar proxies are typically provisioned with a one-size-fits-all approach, leading to inefficient resource utilization. In the absence of a mechanism (e.g., Istio pod annotations exposed in an application’s Helm chart) to override the default proxy resource reservation requests, as well as the lack of education for development teams on how to effectively use observability data to configure and tune these resource reservations, teams onboarding applications can have varying success depending on their application characteristics. Removing the sidecars entirely eliminates inefficiencies caused by underutilization. Moreover, because ztunnel was designed to be a highly performant component built in Rust, it can still handle the high throughput use-case.

Scenario 4: Implementing service mesh to achieve zero-trust in edge environments

“Deploying a service mesh in resource-constrained edge environments is challenging due to the additional overhead of sidecars.”

Edge computing processes data closer to the source, often in less secure or untrusted networks, and with limited resources. Implementing zero-trust ensures that every interaction is authenticated and authorized, significantly reducing the risk of breaches but can be difficult in resource-constrained environments. By leveraging Istio’s ambient mode service mesh which does not require the resource overhead of sidecars, edge computing use cases in IoT, retail, and healthcare can realistically consider implementing zero-trust principles and achieve compliance to enhance their security strategy of edge environments.

These are just some common sidecar challenges that Istio’s ambient mode can help resolve. Other resourcing roadblocks that Istio’s ambient mode can help navigate include reducing application downtime experienced with sidecar deployments and solving complexities with egress configurations.

Istio’s Ambient Mode Helps Simplify Application Operations

Safely restarting a service to upgrade the sidecar version involves three different settings: controlling Envoy draining and rejecting new connections, grace time for active connections to close, and terminating the pod when all active connections are closed.

One of the most common challenges that we see with service mesh adoption is the increase in operational overhead for an application owner to manage the lifecycle of the sidecar. For example, when upgrading from Istio 1.19 to 1.20, every application pod in the cluster must be restarted to apply the new proxy version. You have to consider draining traffic from workloads, terminating connections, and what happens if a pod doesn’t restart due to a dependency being temporarily unavailable.

All of these considerations go away when adopting a sidecarless service mesh architecture! Ambient mode significantly reduces operational burden, providing developers more time to focus on developing application features rather than the infrastructure-related concerns of managing proxies at scale.

To help contextualize this, let’s consider a few challenging scenarios related to operating a service mesh:

Scenario 1: Solving complexity by upgrading Istio deployments

“We have a complex Istio deployment with many customizations, and upgrading to newer versions has been a major pain point. The upgrade process often results in downtime, configuration drift, and unforeseen issues, which disrupt our operations.”

The community Istio support timeline currently follows an N-1 support status which equates to upgrading Istio approximately every 7-9 months. Upgrading Istio often can be highly involved, eliminating the sidecar greatly reduces the risk of disrupting business operations simply due to having fewer components to upgrade and manage.

By focusing on node-level management rather than service-level sidecars, operations teams can perform upgrades more efficiently, and with greater confidence without needing to rely on or coordinate with application teams. It is worth considering that operational cost savings can really add up. For example, if the time required for two engineers to complete an Istio upgrade is reduced from 16 hours to just 2 hours, the savings in both time and cost can be substantial.

Scenario 2: Limited talent and resourcing available for sidecar management

“Onboarding new development teams to use the service mesh is time-consuming and requires extensive training on sidecar management.”

Developers frequently express frustration with managing sidecars, as evidenced by numerous community surveys conducted over the years. Istio’s ambient mode addresses this issue head-on; by eliminating sidecars entirely, there is no need for extensive training on sidecar management. Users of the sidecarless service mesh can trust that if an application is deployed on the cluster, its traffic is inherently encrypted by default.

Scenario 3: Integrating legacy applications causes disruptions

“We have a significant number of legacy applications that are critical to our business operations. Introducing a sidecar-based service mesh disrupts these applications and requires extensive modifications.”

With Istio’s ambient mode, adding a service to the mesh now only requires a label applied at the namespace or pod. Traffic is then intercepted by ztunnel without restarting the application. This greatly simplifies the onboarding of critical applications that have a low tolerance for disruption or modifications that previously were not considered as strong candidates for the mesh. Application owners no longer need to concern themselves with the presence of a sidecar in their workload, the lifecycle of that sidecar, or the cost of the sidecar resource.

Comparing the Usage Cost of Ambient Mode

In order to evaluate the resource usage of service mesh, we will deploy a sample application at scale. This is representative of how a user might actually use service mesh in a production environment, and is based on our discussions with our customers about how they are using Istio and Gloo Mesh today.

Our test workloads represent an application configured in a Namespace-Per-Tenant pattern, where each tenant operates in an isolated namespace to ensure resource and security separation. The application is designed with a classic 3-tier fan-out architecture, where an initial service sends requests to multiple downstream services, providing a more representative assessment of expected performance compared to a naive benchmark where a client targets a single service.

We have 8 containers deployed per tenant, grouped into two “apps” with four pods each arranged in three tiers. With 25 tenants in the cluster, we have a total of 200 microservices.

To ensure a “Guaranteed” quality of service class, every application is configured to have the same pod requests and limits: 0.7 CPU cores and 500MB of memory. (The application we are using is fake-service, a sample application for testing service mesh built by CNCF TAG Network co-chair Nic Jackson.)

Reflecting the fact that users tend to run their clusters at medium utilization, we set a target of < 30% CPU utilization for our baseline application under load, which worked out to a 21-node cluster with n2-standard-8 instances on Google Cloud.

To generate synthetic load we are using Vegeta, with one instance per app (two in each namespace). Four more nodes were utilized for the load generators, sending 200 requests per second to each of the Tier 1 replicas. These requests generate subsequent requests to Tier 2 and Tier 3 along the edges shown, giving a load of 2000 RPS through each tenant and 50000 RPS total across the cluster. We ensure that the latency results we see in our tests are within our defined expectations, with no requests taking more than 10ms at p50 or 15ms at p95.

We took a baseline reading of our application, which we can subtract from the mesh numbers to learn the cost of each option.

Then, we deployed our application with Istio in both sidecars and ambient mode.

You can find our scripts and our outputs on GitHub.

Istio’s Ambient Mode is over 70% Cheaper than Sidecars

Aside from all the operational savings we laid out above, Istio in ambient mode is substantially cheaper to run than Istio in sidecar mode.

Even in a heavily loaded, well-sized cluster, ambient mode at L4 uses 73% less CPU than sidecar mode. In our example cluster, Istio in ambient mode requires 1.28 fewer cores per namespace. Given our 25 namespace environment, that equates to a 32 core saving, or the equivalent of 4 8-core machines. That’s a saving of $1100 per month!

What’s more, the CPU utilization is minimal: a mere 4.78% CPU to add mTLS to our workload, compared with 24.3% extra to add sidecars. Assuming you have at least 5% overhead free on all the nodes in your cluster, you can install Istio in ambient mode without adding any extra nodes, thus making it effectively free to run.

Users looking to implement full L7 functionality and features with Istio’s ambient mode, can do so with confidence. You can opt in per-namespace where needed, but even when we ran a waypoint each for all the tenants – comparable in features to a full sidecar deployment — we still saw substantial CPU savings. Keep an eye out for our upcoming blog that discusses the cost and value of L7 in more depth.

In conclusion

Our research validates the goals in building the ambient data plane: simplifying operations of the service mesh (no sidecars), as well as reducing infrastructure costs (substantial cost savings vs. sidecars, minimal additional cost to fulfill mTLS requirement from baseline).

If users decide to adopt the full L7 feature set by adopting waypoint proxies, the resourcing cost is substantially less than traditional sidecar deployments of Istio, not to mention being easier to manage and improving the experience for developers.

Istio’s ambient mode enables users to adopt a sidecarless architecture and perform service mesh in a truly ‘ambient’ manner for developers. You can learn more about how to extend the functionality of Istio’s ambient mode for enterprise architecture and workloads in the latest 2.6 release of Gloo Mesh.

As a co-creator and leader in the development of the Istio ambient data plane, Solo.io is uniquely positioned to help our customers adopt this architecture for production-grade security and compliance requirements. To find out more about how ambient mode can optimize your application services and connectivity, please reach out and connect with us.

Cloud connectivity done right