Service mesh

Why service mesh, how it works, and 6 tools you should know

What Is a service mesh?

A service mesh is a networking framework  for adding observability, security, and reliability to distributed applications. It achieves this by providing these functions at the platform layer, instead of having them embedded in the application layer.

Technically, a service mesh is a set of lightweight network proxies, typically deployed alongside application code in a “sidecar container”. These proxies form the data plane of the service mesh, which is controlled and configured by a control plane. Proxies perform several important functions including:

  • Handling communication between microservices
  • Service discovery
  • Load balancing
  • Authentication and authorization
  • Observability

The use of service mesh is not limited to the world of cloud native. However, in applications based on containerized or serverless architectures, there is a much greater need for it.

Cloud native applications are typically decomposed into dozens or hundreds of microservices. Each service can have multiple instances, each dynamically scheduled by an orchestrator like Kubernetes. This increased complexity means that communication between services is difficult to manage, and also critical for basic application functions. A service mesh ensures services can communicate reliably, securely, and with high performance.

Explore how service mesh supports zero trust

Why Is a service mesh Important?

Service mesh adoption is rapidly growing. According to a recent CNCF microsurvey, 60% of organizations in the cloud native community are using a service mesh in production, 10% are using it in development, and another 19% are evaluating one.

In general, the more an organization relies on a microservices architecture to build software, and the more microservices each of its applications includes, the more it can benefit from a service mesh.

Here are key benefits of a service mesh:

  • Managing complexity—as a microservices application grows and loads increase, interaction between microservices grows exponentially. Advanced routing is needed to optimize data flow between services, ensure high performance and availability, and provide metrics on the behavior of microservices.
  • Enabling secure communication—it is important to secure inter-service communications, to ensure that if attackers compromise one service, they cannot move laterally to the rest of the application. Service mesh technologies enable mutual transport layer security (mTLS) connections between services.
  • Infrastructure as code—modern DevOps teams manage continuous integration / continuous delivery (CI/CD) pipelines by automatically deploying applications and infrastructure as code within Kubernetes clusters. Service mesh provides the critical capability to manage network and security policies through code.
  • Abstracting communication from the application—a service mesh manages the communication layer, allowing developers to focus on the business logic of their microservices. Otherwise, each microservice would have to handle concerns like communication, authentication, and load balancing.

Service mesh architecture

A service mesh contains a service with a proxy running as a sidecar to the service. It also can configure services and agents such as the data and control planes to provide service management via a unified system. The service requests must pass through two servicing proxies within the mesh: a calling proxy and a receiving proxy.

The service mesh architecture provides an abstraction of all service functions related to non-business logic. The data plane manages agents and services, while the control plane can enforce configurations and policies on the data plane.

The control plane allows the proxies to fulfill several functions in the service mesh:

  • Service discovery—when instances must interact with other services, they must locate and retrieve available, healthy instances of these services. Instances often do this using a DNS lookup. Container orchestration frameworks usually maintain a list of instances that can receive requests, providing a DNS query interface.
  • Load balancing—orchestration frameworks typically offer load balancing at the transport layer (Layer 4). A service mesh supports advanced load balancing at the application layer (Layer 7) using traffic management and other algorithms. The API allows admins to change the load balancing parameters to tune a canary deployment.
  • Verification—a service mesh can authorize and authenticate requests within and outside the application, only sending authenticated requests to each instance.
  • Service monitoring—service meshes can provide insight into the health and behavior of services. The control plane collects and aggregates telemetry data about interactions between components to determine health, including access logs and traffic and latency. Tools like Grafana, Prometheus, and Elasticsearch, alongside third-party integrations, support further visualization and monitoring.

The control plane handles the following:

  • Service registration—the control plane requires a list of endpoints and services available to each service proxy. It queries the system that schedules the underlying infrastructure (e.g., Kubernetes) to compile this list and register the available services.
  • Sidecar proxy configuration—includes configuring policies across the network, which the proxies must know to operate properly.

Learn more in our detailed guide to service mesh architecture

Service mesh pros and cons

Service meshes solve some (but not all) of the key problems of managing communication between services. A service mesh provides the following benefits:

  • It simplifies inter-service communication in container-based and microservices architectures.
  • Communication errors occur at the infrastructure layer and are easy to diagnose.
  • A service mesh supports security features like authentication, authorization, and encryption.
  • Accelerated application development, testing, and deployment.
  • Sidecars located next to container clusters can effectively manage network services.

However, a service mesh also has the following disadvantages:

  • Using a service mesh increases the number of runtime instances.
  • All service calls must go through a sidecar proxy, adding an extra step to the communication process.
  • A service mesh does not always support integration with other systems or services and does not map transformations or provide routing.

The complexity of network management persists despite being abstracted and centralized. A human team still needs to manage configurations and integrate the service mesh into existing workflows.

Service mesh vs. API gateway

An API gateway is a service that accepts incoming API requests from clients, routes the requests to the appropriate application service, processes the service’s response, and relays the response to the requesting client. API gateways mainly focus on externally initiated requests and manage client-to-server communication.

A service mesh handles internal requests that microservices send to other microservices in the application. It primarily manages service-to-service communication.

API gateways are often easier to deploy and manage because they only need to be deployed once in a software stack, and provide simple centralized monitoring. A service mesh must be integrated with all application services, typically by running it alongside services in a sidecar container.

Most microservices applications can benefit from using both an API gateway and a service meshe. The two systems can work together—the API gateway receives end-user requests and forwards them to specific microservices. When that microservice needs to communicate with other microservices, it uses the service mesh.

Learn more in our detailed guide to service mesh vs. API gateway

6 service mesh choices for kubernetes

Here are some common technologies for implementing a service mesh. They include commercial and open source offerings with vendor-provided enterprise support.

Related content: Read our guide to service mesh for Kubernetes

Istio

License: Apache License 2.0
Repository:
https://github.com/istio/istio

First created by Lyft, Istio is a native Kubernetes solution offering additional features like deep analytics. It has support from several major tech companies, including Google, IBM, and Microsoft—it is these companies’ default service mesh option for Kubernetes.

Istio caches data to separate the data plane from the control plane, a pod running on a cluster in Kubernetes. It offers high resiliency to pod failures wherever they occur in the service mesh.

Istio offers the following capabilities:

  • Security—helps application teams implement a zero trust strategy by defining and enforcing authentication/authorization and access control policies. Istio encrypts all data in the communications between services, including within and outside the cluster, using an mTLS protocol. It also provides JWTs (JSON Web Tokens) to authenticate applications from external and internal users.
  • Resilience—eliminates the need for a coding circuit breaker in an application. Istio allows platform architects to specify resilience mechanisms without the application’s knowledge, including the number of timeouts to each service, retries, and automatic failovers of high-availability systems.
  • Visibility—tracks network requests and provides a trail for all calls across different services. Istio provides telemetry data like latency, traffic health, errors, and saturation, helping SREs understand the behavior of each service and troubleshoot and tune applications.
  • Advanced deployments—Istio provides granular visibility and network controls for workloads such as containerized and VM-based deployments. Istio facilitates blue-green and canary deployments by routing user groups to new applications.

According to the 2022 GigaOm Service Mesh Radar report, “Solo.io Gloo Mesh continues to be the leading Istio-based service mesh, incorporating built-in best practices for extensibility and security and simplified, centralized Istio and Envoy lifecycle management.”

Learn more in our detailed guide to Istio

Linkerd

License: Apache License 2.0
Repository:
https://github.com/linkerd/linkerd2

Linkerd is probably the second most commonly used service mesh for Kubernetes. Its architecture is similar to Istio v2 and up. Linkerd prioritizes simplicity rather than flexibility and is Kubernetes-specific (v2)—it helps reduce the number of moving parts to reduce complexity. Linkerd v1.x supports additional container platforms and has continued support, but it lacks the new features introduced for Kubernetes in v2.

Linkerd offers the following features:

  • Authorization policies—restrict the types of traffic allowed within the network.
  • Automated mTLS—automatically implements mutual Transport Layer Security across all inter-app communication.
  • Automated proxy—Linkerd automatically injects the data plane proxy into pods based on annotations.
  • Fault injection—provides a mechanism to inject faults into services.
  • HTTP and gRPC proxies—Linkerd automatically enables advanced features like retries, metrics, and load balancing to support HTTP, HTTP/2, and gRPC connections.
  • Complete on-cluster metrics—Linkerd offers a full metrics stack with dashboards and CLI tools.
  • Retry and timeout policies—perform retries and timeouts specific to a service.
  • TCP proxy and protocol detection—Linkerd can proxy all TCP-based traffic, including TLS connections, HTTP tunneling, and WebSockets.

Learn more in our detailed guide to Linkerd

Consul Connect

License: Mozilla Public License 2.0
Repository:
https://github.com/hashicorp/consul

Connect is an addition to HashiCorp’s Consul framework for managing services. While initially intended for service management on Nomad, Consul now supports several container platforms like Kubernetes. Consul Connect offers service discovery capabilities that provide complete service mesh functionality. An agent installed on every node serves as a daemon set, communicating with the Envoy sidecar proxy routing and forwarding traffic.

Connect offers these capabilities:

  • Advanced Kubernetes support—provides a Helm chart to automatically install, configure, and upgrade Consul as a service mesh on Kubernetes.
  • Platform-agnostic—Consul is compatible with all cloud providers and architectures.
  • Multi-cluster—features like service catalog synchronization and auto-joining can extend the boundaries of Kubernetes clusters and include external services.

mTLS—the control plane supports a service mesh configuration that enforces mutual TLS, automatically generating and distributing TLS certificates for each service within the mesh.

Traefik

License: MIT License
Repository:
https://github.com/traefik/traefik

Traefik (formerly Maesh) is an easily configured service mesh that increases traffic observability and manageability in Kubernetes clusters. It offers advanced traffic management capabilities, including line blocking and rate limiting.

Other features include:

  • Open source load balancer and reverse proxy—Traefik replaces the standard Envoy sidecar proxy used in a service mesh. It supports several load balancing algorithms.
  • SMI support—Service Mesh Interface (SMI), an industry standard for service mesh implementations.
  • Automated updates—Traefik continuously updates the configuration without requiring restarts. Wildcard encryption certificate support—enables HTTPS to microservices.
  • Single-file—Traefik is available as one binary file in a small Docker image.

NGINX Service Mesh

License: Primarily open source
Repository:
https://github.com/nginx

NGINX Service Mesh offers secure, scalable egress and ingress traffic management. It works for small Kubernetes clusters and large deployments. NGINX Plus acts as a sidecar proxy alongside the proprietary ingress controller for Kubernetes. NGINX Service Mesh integrates with Grafana, Open Tracing, and Prometheus to provide observability. It offers several capabilities to help handle traffic, including:

  • Service throttling
  • Rate shaping
  • Canary deployments
  • A/B testing

AWS App Mesh

License: Apache License 2.0
Repository:
https://github.com/aws/aws-app-mesh-roadmap

App Mesh makes monitoring and managing control services easier. It provides a dedicated layer to handle communications between services. Standardizing inter-service communication provides full application visibility, high availability, and network traffic control.

It is important to note that while AWS App Mesh is open source, it is designed for use only within the AWS ecosystem.

AWS App Mesh offers these features:

  • Traffic routing—configures services to connect directly rather than using load balancers or requiring code. When services start, their proxies connect to App Mesh to receive configuration information about other mesh services. It provides controls that dynamically update the traffic routing without changing the application code.
  • Inter-service authentication—uses mTLS to enable authentication at the transport layer, providing identity verification between services and application components. Customers can provision certificates to extend the security perimeter to all applications running in the mesh. The certificate authority helps enforce authentication for applications attempting to connect to a service.
  • Kubernetes-native user experience—App Mesh supports various AWS services and Kubernetes on EC2. Customers can include the proxy provided by App Mesh in the pod and task definition for containerized and microservice-based workloads. They can configure the application container for each service to communicate with the proxy. When a service starts, App Mesh automatically configures the proxy.
  • Fully managed service—App Mesh enables service communication management without installing or managing infrastructure at the application level.

How to choose and evaluate a service mesh solution

Determine the Need for a Service Mesh Architecture

Developers and IT architects usually know the basics of service mesh architectures. However, it is important to consider the expected use cases and appropriate implementation approach for a service mesh. They must assess the pros and cons of a service mesh and understand how it manages and enhances communication between services.

The underlying requirements determine the service mesh implementation. For instance, large enterprises use containerized service meshes for scalability. In a large-scale containerized deployment, service mesh can help:

  • Enforce TLS connections between containers.
  • Use load balancing policies to route traffic.
  • Monitor performance and identify problem areas using telemetry.

Although these tasks are possible in Kubernetes using add-on services such as application-layer load balancers (such as NGINX) and monitoring tools (such as Prometheus or Grafana), a service mesh does this automatically, eliminating the management overhead.

Choose a Base Service Mesh Platform

When an organization determines that a service mesh can provide benefits for its use case, it is time to select the base service mesh platform.

There are several popular open source technologies which can serve as an excellent basis for a service mesh deployment. We listed them above:

  • Istio
  • Linkerd
  • HashiCorp Consul Connect
  • Traefik
  • NGINX Service Mesh

Another approach, suitable for organizations mainly invested in one cloud platform, is to use a service mesh technology from their cloud provider, such as AWS App Mesh.

Choose Between Open Source, Commercial, or Managed

There are three main options for implementing a service mesh:

    • Using an open source service mesh like Istio or Linkerd as is, without support from a commercial vendor. This option saves on software license costs, but requires more expertise and a larger implementation effort.
    • Using a commercial version of an open source service mesh platform, such as Gloo Mesh, Solo’s commercial offering based on Istio, or Bouyant Enterprise Support for Linkerd. This lets you capitalize on the strengths of an open source solution while gaining access to enterprise-grade tooling and support.
  • Using a managed service mesh such as AWS AppMesh in combination with Amazon services like the Elastic Kubernetes Service (EKS). This reduces the learning curve and complexity of a service mesh, but also limits you to a specific cloud environment and service mesh solution.

Testing and Production Deployment

Service mesh testing can be complex, because platforms like Istio include many features and modules, and operate across large distributed systems. Service mesh projects are making an effort to simplify testing—for example, Istio provides the Istio Testing Framework, which simplifies test creation and execution through a set of code modules, and the Istio Lab GitHub project, which has a library of tests for various Istio features.

Implementing and deploying service mesh technology spans multiple disciplines, including application development and IT operations. Therefore, DevOps organizations will find it easier to deploy a service mesh solution. It is a good idea to have all elements of the DevOps team—including security teams—participate in the service mesh deployment.

If the organization currently does not use a full DevOps model, the adoption of a service mesh can be a good time to embrace cooperation between development and operations and the creation of cross-functional teams.

Service mesh with Solo.io

Solo.io provides Enterprise service mesh based on Istio and Envoy, Gloo Mesh, part of the integrated Gloo Platform. Gloo Mesh Enterprise delivers connectivity, security, observability, and reliability for Kubernetes, VMs, and microservices spanning single cluster to multi-cluster, hybrid environments, plus production support for Istio.

According to the 2022 GigaOm Service Mesh Radar report, “Solo.io Gloo Mesh continues to be the leading Istio-based service mesh, incorporating built-in best practices for extensibility and security and simplified, centralized Istio and Envoy lifecycle management.”

Cloud connectivity done right