Gloo Mesh, The 100 Million Pod Mesh

Gloo Mesh’s ambient multi-cluster mode sets a new benchmark for scalability. In this test, we deploy 100 million pods across 2,000 clusters, proving it can handle extreme scale with minimal resources, near-instant updates, and no manual tuning, resulting in effortless scalability and cost efficiency for enterprises.
February 3, 2025
John Howard

In the previous post, we introduced Gloo Mesh's new ambient multi-cluster mode that is simpler, safer, and more scalable than previous multi-cluster service mesh architectures.

However, we cannot be fully confident in how scalable our solution is without putting it to the test. We know it will scale, but just how much? In this post, we will push the mesh further than ever before, and deploy a single mesh consisting of an industry-first 100 million pods total!

Setup

For this test, we will deploy a fully interconnected multicluster mesh, where each cluster is connected to every other cluster. Gloo Mesh also supports asymmetric connectivity which can support common topologies such as a hub-and-spoke model, but in this test we make use of the topology that is hardest to scale.

Each cluster consists of:

  • 1,000 services
    - 10% of services are exported for global consumption
  • 50,000 pods
    - Every second, a pod is replaced (one deleted, one added)
  • 1,000 nodes
  • 1 Istiod control plane pod — this can be scaled up, but a single replica is sufficient for the test.

To start, we will just run a single cluster, then gradually connect to an increasing number of clusters.

A single cluster on its own represents a sizeable mesh, but presents no problems at all using roughly 10% of a CPU:

Next, we scale up things up to a massive mesh size of 500 connected clusters, orders of magnitude beyond all but the largest real-world environments. The result is… boring!

With 25 million pods enrolled in the mesh, we see virtually no impact on CPU utilization of the control plane, and only a small increase in memory. Finally, we complete our scaling, for a grand total of:

  • 2,000 connected clusters
  • 100,000,000 pods
  • 2,000 pod changes per second
  • 2,000,000 nodes
  • Still 1 Istiod control plane pod per cluster

Amazingly, the control plane usage is nearly unphased, fitting in the footprint of a Raspberry Pi.

Aside from resource utilization, another important metric is the time to propagate changes. This ensures workloads are not running on stale data, which could lead to sending traffic to pods that have been removed, for instance.

Endpoint Propagation Time chart, showing a consistent 100ms maximum time

Through the test, the maximum propagation time for any change was observed to be nearly instant – less than 100ms (note: the underlying metric measuring this has a granularity of 100ms which is why the line is perfectly flat). This means that even at the largest scales, our cluster is not only cheap to operate but is incredibly stable as well!

All of this is achieved with out-of-the-box configuration for Istio. No tuning required, no manual toil like configuration scoping: it just works!

Testing Details

You might notice the final test environment has over 2 million nodes, which retails for roughly a billion dollars a month. For obvious reasons, we did not literally spin up 2 million machines to run this test. Instead, we rely on a simulated environment.

A potential issue with any simulation is not accurately reproducing a real world environment, leading to artificially positive results. The simulation we use here, however, has been in use for 5+ years to measure Istio's performance, with substantial efforts taken to remove any variance between a real and simulated environment. The environment includes:

  • A real Kubernetes API server, etcd, and controller manager running.
  • Real objects written to the API server, such as Pod, Service, and Node.
  • Real xDS clients connecting to the Istiod API server.
  • Changes to objects, ensuring they are not unrealistically static.

Essentially, the only difference between a real-world environment and the simulation is that Pod/Node objects exist only as objects in the API server, and do not have real containers/machines backing them. However, this difference has no impact on the Istio control plane.

These simulated tests are routinely compared to real users' production environments and reproduce give similar results, giving us confidence in the accuracy. Similar approaches are used by a number of other projects as well.

What does this mean for me?

While it's pretty likely you aren't planning to deploy a 100 million pod mesh anytime soon, the scalability of Gloo Mesh can still benefit you, giving confidence that whatever your scale, Gloo Mesh can handle it with ease and minimal cost!

Interested in getting started? Read the docs or contact us to get started today.

Cloud connectivity done right