Observability is one of the key problems that service meshes like Istio can solve easily.
If you are using Istio, all of your applications will produce standardized metrics — regardless of the technology in use — to meet the requirements of the RED (rate, errors, duration) method.
These metrics are essential to operate business critical workloads in production, but can lead to other challenges such as storage and collection at scale, cardinality explosion, and operating actual Prometheus instances to make these metrics queryable.
Since Gloo Platform can be an orchestrator of one or more service meshes, having a scalable telemetry pipeline is crucial to address these challenges.
The platform also has a Graph, where you can understand how the application are depending on each other, and you can quickly identify performance degradations as well.
Challenges
Initially, our Platform wasn’t driven by OpenTelemetry.
Take a look the architecture diagrams to understand where we were coming from.
As you can see, originally our Gloo Agent was responsible for both collecting metrics in the workload clusters and forwarding these to the management cluster.
This approach had two main issues.
Issue #1: Too Many Responsibilities
Gloo Agents were already over-employed, and had two jobs besides collecting the metrics:
- Resource discovery and sending this information to the management cluster
- Applying resources in the workload clusters
These were full-time jobs already by themselves, so putting more responsibilities to the Agents led to scalability issues in the pipeline.
Issue #2: Lack of Control
The second issue was not having the ability to transform, filter, and integrate this telemetry data.
The Agents didn’t know how to create new labels or how to filter and drop metrics, and pushing the data to multiple locations (long term storages, SaaS observability tools, etc.) was troublesome.
Originally, all the scraped metrics were shipped to the management cluster, where a Prometheus instance scraped the management server, the component that exposed all of these metrics.
Adding and removing scrape targets was not an easy task (forget about Prometheus-like scrape configs), and since everything (we are talking about 10s, or even 100s of Kubernetes clusters with Istio on top of them) was pushed to and exposed from a single destination, oftentimes Prometheus was struggling to perform well.
It was clear that the architecture needed to be revisited to solve these limitations.
Why OpenTelemetry, and what does the new pipeline look like?
After investigating the aforementioned issues, we realized that they can only be resolved by having a dedicated component for the observability tasks.
Leveraging something like Thanos can be also an option, but it’s always better to keep things simple, and focus on the business’ core challenges. Storing metrics for years is not the kind of business Solo.io is in.
We could either offloading these tasks from the Agents and build a new telemetry component from scratch, or we could leverage an existing tool, if such thing exists.
Fortunately, there’s one that’s built for this exact purpose, and it’s called OpenTelemetry (OTel, from now on) Collector.
With OTel in place, this is what our new pipeline looks like:
How does it work?
Default pipeline
We have a default pipeline with a single purpose: collect all the relevant metrics for our Graph in the workload clusters and ship them to the management cluster.
Then, in the management cluster our Prometheus can scrape these metrics, making them available for our UI.
This is done by having the OTel collectors as Daemonsets on the nodes in workload clusters. These are scraping all the interesting metrics targets, including Istio injected workloads, istiod, Cilium components, and the collector itself.
The collectors then apply filters to get rid of all the metrics and labels that we don’t need for our UI. This is what we call the Minimum Metrics Set. The collectors then push these to the management server via an otlp
exporter.
On the other side, we have the Gloo Telemetry Gateway, that is again a collector in disguise, just configured differently (e.g. it’s a Deployment). It has an otlp
receiver as an input — notice that this is the output of the collectors in the workload clusters, then exposing these metrics to our Prometheus.
Extending the pipeline
Having the default pipeline to drive our Graph is nice, since we have a lot more control than we had before, but now that we have OpenTelemetry in our stack, we cannot stop here, can’t we?
One of the other benefits of having OTel is its vibrant ecosystem. You can take a look at all the various receivers, processors, and exporters in the contrib repository, and you will probably find what you are looking for.
Once you have all the LEGO pieces, you can compose them into pipelines to power other tools as well. Let’s imagine you would want to drive our UI, but you would also want to push everything to a long term storage such as Thanos, or a SaaS provider like DataDog or New Relic. With the help of the pipelines, you can easily achieve this. Your security team also needs your logs after some transformation for their SIEM system? Not an issue!
Conclusion & future ideas
As you can see, introducing OpenTelemetry into our stack has opened the doors to a way more flexible and scalable telemetry solution with an extensive ecosystem used by thousands of engineers every day.
We are just getting started! Be on the lookout for new features such as our new Portal Analytics powered by ClickHouse, or enriching existing metrics with cloud provider metadata debuting in Gloo Platform 2.4.0.