Monitor LLM usage with Gloo AI Gateway Consumption Reporting

Today, we'll discuss consumption reporting, a technique that allows you to extend the power of the built-in monitoring tools in Gloo AI Gateway. By implementing consumption reporting, you can track the details of LLM calls within your organization. Token usage can be monitored by AI provider and model, along with any custom labels relevant to your business.

What is consumption reporting?

Observability helps you understand how your system is performing, identify issues, and troubleshoot problems. Gloo AI Gateway provides a rich set of observability features that help you monitor and analyze the performance of your AI Gateway and the LLM providers that it interacts with.

While observability features like access logs are great for understanding individual requests, metrics are better for understanding the overall performance of your system. Gloo AI Gateway provides a rich set of default metrics that help you monitor and analyze the performance of your AI Gateway and the LLM providers that it interacts with. In addition, you can add custom labels to these metrics to develop a system of consumption reporting, which can help you better understand the context of the requests, such as the number of calls each team in your organization makes.

Why should you monitor token consumption?

There are several reasons why consumption reporting is valuable:

Cost management: Many LLM services, like OpenAI’s GPT models, charge based on the total number of tokens processed. Keeping an eye on token consumption helps you avoid unexpected costs, especially if you're running large queries or running multiple agentic workflows.
Token limits: Many models have token limits, meaning there's a maximum number of tokens that can be processed in a single request (including both the prompt and the generated output). Monitoring helps you ensure your inputs don't exceed these limits, avoiding errors, truncated responses, or the need for splitting data into multiple requests.
Scalability: If you're building a system that scales with user interaction (e.g., chatbots, content generation, or data analysis), monitoring token usage helps you predict the system's overall resource needs. It also helps identify whether you're using the model efficiently, especially if you're handling thousands or millions of requests.

Set up consumption reporting with Solo.io’s Gloo AI Gateway

Now that you understand the benefits of consumption reporting, let’s walk through how to implement it using Solo.io’s Gloo AI Gateway.

Prerequisites

Before getting started, make sure you have the following:

Gloo Gateway Enterprise license key with an AI Gateway add-on: Contact a Solo.io account representative to obtain a Gloo Gateway Enterprise license key. Make sure you include the AI Gateway add-on in your license.
Gloo Gateway Enterprise installation: If you haven't already, you can follow the Solo.io Gloo Gateway docs.

Set up the Gloo AI Gateway

If you haven’t already, start by setting up your Gloo AI Gateway, and authenticating the gateway with your AI provider. To get started, follow these Gloo AI Gateway docs:

Default metrics

Let’s start by taking a look at the default metrics that the system outputs, based on some simple requests to the AI provider.

First, get the external address for your AI gateway.

export INGRESS_GW_ADDRESS=$(kubectl get svc -n gloo-system gloo-proxy-ai-gateway -o jsonpath="{.status.loadBalancer.ingress[0]['hostname','ip']}")
echo $INGRESS_GW_ADDRESS

For this demo, let's try asking our AI provider to write a poem.

curl -v "${INGRESS_GW_ADDRESS}:8080/openai" -H content-type:application/json -d '{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "system",
      "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."
    },
    {
      "role": "user",
      "content": "Compose a poem that explains the concept of recursion in programming."
    }
  ]
}' | jq

You'll likely get a response similar to this one:

{
  "id": "chatcmpl-AEHdIbIIY5fRbeMwWlg30g086vPGp",
  "object": "chat.completion",
  "created": 1727967736,
  "model": "gpt-3.5-turbo-0125",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "In the realm of code, a concept unique,\nLies recursion, magic technique.\nA function that calls itself, a loop profound,\nUnraveling mysteries, in cycles bound.\n\nThrough countless calls, it travels deep,\nInto the realms of logic, where secrets keep.\nLike a mirror reflecting its own reflection,\nRecursion dives into its own inception.\n\nBase cases like roots in soil below,\nPreventing infinite loops that can grow,\nBreaking the chain of repetition's snare,\nGuiding the function with utmost care.\n\nA recursive dance, elegant and precise,\nSolving problems with coding advice.\nInfinite possibilities, layers untold,\nRecursion, a story waiting to unfold.",
        "refusal": null
      },
      ...

Now we can check the default metrics that were collected for this request and response. In another tab in your terminal, port-forward the ai-gateway container of the gateway proxy.

kubectl port-forward -n gloo-system deploy/gloo-proxy-ai-gateway 9092

In the previous tab, run the following command to view the metrics.

curl localhost:9092/metrics

In the output, search for these metrics:

‍ai_completion_tokens_total
ai_prompt_tokens_total

These metrics total the number of tokens used in the prompt and completion for the openai model gpt-3.5-turbo.

# HELP ai_completion_tokens_total Completion tokens
# TYPE ai_completion_tokens_total counter
ai_completion_tokens_total{llm="openai",model="gpt-3.5-turbo"} 539.0
...
# HELP ai_prompt_tokens_total Prompt tokens
# TYPE ai_prompt_tokens_total counter
ai_prompt_tokens_total{llm="openai",model="gpt-3.5-turbo"} 204.0

These metrics tell you how many tokens have been processed for all LLM API requests that have been routed through your AI Gateway. Additionally, these metrics can help you identify which providers and which models are being called to process requests.

Custom metrics

In the previous step, you reviewed metrics for total token usage. These default metrics can help you ensure your organization doesn’t exceed limits that you want to implement across all teams and organizations.

But what if you need to review more specific usage? Default metrics are useful for gauging LLM usage over time, but don’t help you understand usage by each team. You can add that context by creating custom labels.

In this example, you gather and observe key metrics related to teams’ LLM provider usage based on extracting claims from JWT tokens for two users, Alice and Bob.

Authenticate users with JWTs

Start by creating a VirtualHostOption resource to define an inline JWT provider. The JWT provider is used to validate the JWTs that are sent as part of the requests to the Gloo AI Gateway. In this example, the JWT provider validates the JWT by using the public key that you add to the VirtualHostOption resource.

kubectl apply -f- <<EOF
apiVersion: gateway.solo.io/v1
kind: VirtualHostOption
metadata:
  name: jwt-provider
  namespace: gloo-system
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: ai-gateway
  options:
    jwt:
      providers:
        selfminted:
          issuer: solo.io
          jwks:
            local:
              key: |
                -----BEGIN PUBLIC KEY-----
                MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAskFAGESgB22iOsGk/UgX
                BXTmMtd8R0vphvZ4RkXySOIra/vsg1UKay6aESBoZzeLX3MbBp5laQenjaYJ3U8P
                QLCcellbaiyUuE6+obPQVIa9GEJl37GQmZIMQj4y68KHZ4m2WbQVlZVIw/Uw52cw
                eGtitLMztiTnsve0xtgdUzV0TaynaQrRW7REF+PtLWitnvp9evweOrzHhQiPLcdm
                fxfxCbEJHa0LRyyYatCZETOeZgkOHlYSU0ziyMhHBqpDH1vzXrM573MQ5MtrKkWR
                T4ZQKuEe0Acyd2GhRg9ZAxNqs/gbb8bukDPXv4JnFLtWZ/7EooKbUC/QBKhQYAsK
                bQIDAQAB
                -----END PUBLIC KEY-----
EOF

Next, create some JWT tokens for the example users Alice and Bob in environment variables. The JWT tokens include the name of the teams that each user works on.

Alice works on the dev team:

export ALICE_TOKEN=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyAiaXNzIjogInNvbG8uaW8iLCAib3JnIjogInNvbG8uaW8iLCAic3ViIjogImFsaWNlIiwgInRlYW0iOiAiZGV2IiwgImxsbXMiOiB7ICJvcGVuYWkiOiBbICJncHQtMy41LXR1cmJvIiBdIH0gfQ.I7whTti0aDKxlILc5uLK9oo6TljGS6JUrjPVd6z1PxzucUa_cnuKkY0qj_wrkzyVN5djy4t2ggE1uBO8Llpwi-Ygru9hM84-1m53aO07JYFya1VTDsI25tCRG8rYhShDdAP5L935SIARta2QtHhrVcd1Ae7yfTDZ8G1DXLtjR2QelszCd2R8PioCQmqJ8PeKg4sURhu05GlBCZoXES9-rtPVbe6j3YLBTodJAvLHhyy3LgV_QbN7IiZ5qEywdKHoEF4D4aCUf_LqPp4NoqHXnGT4jLzWJEtZXHQ4sgRy_5T93NOLzWLdIjgMjGO_F0aVLwBzU-phykOVfcBPaMvetg

Bob works on the ops team:

export BOB_TOKEN=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyAiaXNzIjogInNvbG8uaW8iLCAib3JnIjogInNvbG8uaW8iLCAic3ViIjogImJvYiIsICJ0ZWFtIjogIm9wcyIsICJsbG1zIjogeyAibWlzdHJhbGFpIjogWyAibWlzdHJhbC1sYXJnZS1sYXRlc3QiIF0gfSB9.p7J2UFwnUJ6C7eXsFCSKb5b7ecWZ75JO4TUJHafjLv8jJ7GzKfJVk7ney19PYUrWrO4ntwnnK5_sY7yaLUBCJ3fv9pcoKyRtJTw1VMMTQsKkWFgvy-jEwc9M-D5lrUfR1HXGEUm6NBaj_Ja78XScPZb_-APPqMIvzDZU04vd6hna3UMc4DZE0wcnTjOqoND0GllHLupYTfgX0v9_AYJiKRAcJvol1W14dI7szpY5GFZtPqq0kl1g0sJPg-HQKwf7Cfvr_JLjkepNJ6A1lsrG8QbuUvMUAdaHzwLvF3L_G6VRjEte6okZpaq0g2urWpZgdNmPVN71Q_0WhyrJTr6SyQ

Make sure that they can still access the AI API, now that it is protected with a JWT provider. For example, send another poem request to the LLM, but this time be sure to include the JWT token for Alice in the Authorization header.

curl -v "${INGRESS_GW_ADDRESS}:8080/openai" -H "Authorization: Bearer $ALICE_TOKEN" -H content-type:application/json -d '{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "system",
      "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."
    },
    {
      "role": "user",
      "content": "Compose a poem that explains the concept of recursion in programming."
    }
  ]
}' | jq

Because Alice's JWT is successfully validated, access to the AI API is granted. You should see a similar response to the one you received earlier.

{
  "id": "chatcmpl-AEHdIbIIY5fRbeMwWlg30g086vPGp",
  "object": "chat.completion",
  "created": 1727967736,
  "model": "gpt-3.5-turbo-0125",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "In the realm of code, a concept unique,\nLies recursion, magic technique.\nA function that calls itself, a loop profound,\nUnraveling mysteries, in cycles bound.\n\nThrough countless calls, it travels deep,\nInto the realms of logic, where secrets keep.\nLike a mirror reflecting its own reflection,\nRecursion dives into its own inception.\n\nBase cases like roots in soil below,\nPreventing infinite loops that can grow,\nBreaking the chain of repetition's snare,\nGuiding the function with utmost care.\n\nA recursive dance, elegant and precise,\nSolving problems with coding advice.\nInfinite possibilities, layers untold,\nRecursion, a story waiting to unfold.",
        "refusal": null
      },
      ...

Track token usage based on team

Now, you can create custom metrics labels that help you track token usage by team.

To add custom labels to the metrics, update the GatewayParameters resource. In the stats.customLabels section, add a list of labels that contain the name of the label and the dynamic metadata field to get the label from.

kubectl apply -f- <<EOF
apiVersion: gateway.gloo.solo.io/v1alpha1
kind: GatewayParameters
metadata:
  name: gloo-gateway-override
  namespace: gloo-system
spec:
  kube:
    aiExtension:
      enabled: true
      stats:
        customLabels:
          - name: "team"
            metadataKey: "principal:team"
EOF

In this resource, the team label sources from the team field of the JWT token. The metadata namespace defaults to the namespace where you defined the JWT provider, but you can specify a different namespace if you have a different source of metadata. When you apply this resource, the gateway proxy restarts to pick up the new stats configuration.

Let’s try sending some requests again to generate metrics data. Send one request with Alice’s token, and one with Bob’s token.

curl "$INGRESS_GW_ADDRESS:8080/openai" -H "Authorization: Bearer $ALICE_TOKEN" -H content-type:application/json -d '{
"model": "gpt-3.5-turbo",
"messages": [
  {
    "role": "user",
    "content": "Please explain the movie Dr. Strangelove in 1 sentence."
  }
]
}'
curl "$INGRESS_GW_ADDRESS:8080/openai" -H "Authorization: Bearer $BOB_TOKEN" -H content-type:application/json -d '{
"model": "gpt-3.5-turbo",
"messages": [
  {
    "role": "user",
    "content": "Please explain the movie Dr. Strangelove in 1 sentence."
  }
]
}'

In your port-forwarding tab, run the port-forward command again for the ai-gateway container of the gateway proxy.

kubectl port-forward -n gloo-system deploy/gloo-proxy-ai-gateway 9092

In the previous tab, check the metrics again.

curl localhost:9092/metrics

In the output, search for the same metrics that total the number of tokens used in the prompt and completion for the openai model gpt-3.5-turbo:

ai_completion_tokens_total
ai_prompt_tokens_total

This time, the metrics are broken out on a team-by-team basis. This way, you can easily check how many tokens Alice’s dev team is using, and how many Bob’s ops team is using!

# HELP ai_completion_tokens_total Completion tokens
# TYPE ai_completion_tokens_total counter
ai_prompt_tokens_total{llm="openai",model="gpt-3.5-turbo",team="dev"} 21.0
ai_prompt_tokens_total{llm="openai",model="gpt-3.5-turbo",team="ops"} 21.0
...
# HELP ai_prompt_tokens_total Prompt tokens
# TYPE ai_prompt_tokens_total counter
ai_completion_tokens_total{llm="openai",model="gpt-3.5-turbo",team="dev"} 18.0
ai_completion_tokens_total{llm="openai",model="gpt-3.5-turbo",team="ops"} 30.0

Conclusion

In conclusion, implementing consumption reporting with Gloo AI Gateway is an effective way to manage and monitor token usage across your organization.

By utilizing default and custom metrics, your business can track LLM consumption by model, provider, and even specific teams, ensuring efficient resource usage and preventing cost overruns. With tools like JWT authentication and custom labels, teams can gain deeper insights into token usage, allowing for better scalability, cost management, and performance analysis. This approach not only enhances system observability but also empowers your organization to optimize your AI operations and maintain control over usage patterns.

We encourage you to explore Solo.io’s documentation for more detailed instructions and additional resources. Happy coding, and we hope this helps you get started with consumption reporting!

Monitor LLM usage with Gloo AI Gateway Consumption Reporting

What is consumption reporting?

Why should you monitor token consumption?

Set up consumption reporting with Solo.io’s Gloo AI Gateway

Prerequisites

Set up the Gloo AI Gateway

Default metrics

Custom metrics

Authenticate users with JWTs

Track token usage based on team

Conclusion

Featured content

Overhaul of Agent Gateway supporting A2A, MCP, and Kubernetes Gateway API

How Ambient Mesh Delivers Advanced Resource and Cost Savings

Getting Started with Ambient Mesh: From 0 to 100 mph

Agent Discovery, Naming, and Resolution - the Missing Pieces to A2A

Part Two: MCP Authorization The Hard Way

Part One: MCP Authorization The Hard Way

Agent Identity and Access Management - Can SPIFFE Work?

Deep Dive into llm-d and Distributed Inference

Gloo Mesh 2.8 simplifies service mesh operations with new enhanced user experience across multi-cluster environments.

Gloo Gateway 1.19 accelerates context-rich, real-time AI apps with Gateway API

llm-d: Distributed Inference Serving on Kubernetes

AI Reliability Engineering For More Dependable Humans

Kubernetes Identity the Right Way with SPIRE and Ambient

Optimizing GenAI in Production: High-Value Use Cases for AI Gateways

Solo.io Recognized as a Visionary in the 2024 Gartner® Magic Quadrant™ for API Management for the SECOND year in a row.

Guardians of the Governance: GenAI Gateway Guidance with GitOps and Gloo

Istio Ambient Waypoint Proxy explained

Hands-On with the Kubernetes Gateway API and Envoy Proxy: A Tutorial with GitOps and Gloo Gateway

Istio and the State of DevOps: Enhancing Key Metrics

What is an AI Gateway and its role in AI Applications?

Best practices for secure Istio deployment with Gloo Mesh Core

Gloo Mesh 2.6: Istio's Ambient mode now ready for production

HTTP Observability Without Compromises

Advance your knowledge of service mesh tech with Solo.io Academy certifications

Service Mesh for the developer workflow, a series

Challenges of adopting service mesh in enterprise organizations

Service Mesh in the Real World #2 — Ingress Traffic Control

Service Mesh in the Real World Video Series – Episode # 1: Egress Traffic

Service Mesh the easy way with AWS App Mesh and SuperGloo

Webinar Recap: Intro to Service Mesh Hub and SMI

D-TECK Uses Solo.io Gloo Gateway and Google Cloud to Help Businesses Make Better HR Decisions

Minimize the blast radius of changes with Solo.io Gloo Gateway and Weaveworks Flagger

Announcing Service Mesh Interface (SMI) Support and Collaboration

Service Mesh Interface (SMI) and our Vision for the Community and Ecosystem

The need for a standard, service mesh API

SuperGloo to the Rescue! Making it easier to write extensions for Service Mesh

Introducing The Service Mesh Hub -everything you need for your service mesh

Kubernetes Ingress Past, Present, and Future

Solo.io Streamlines Service Mesh and Serverless Adoption for Enterprises in Google Cloud

Ingenico

ParkMobile

Vonage

Domino’s Pizza

Gloo Mesh Feature Comparison

Service Mesh for Developers, Part 1: Exploring the Power of Observability and OpenTelemetry

Service Mesh at Scale

Compare Capabilities of the Top Service Mesh Platforms

Compare Capabilities of the Top API Gateways

Establishing zero trust security for modern cloud architectures

Unlocking the Power of Your API Gateway

API Gateways: Productivity, Resilience, and Security for Next-Generation Cloud Applications

Driving Business Value with Istio

Service Mesh Vendor Comparison

Istio Then & Now

4 Reasons Why You Need an AI Gateway

Gloo Gateway vs. Kong

Gloo Gateway vs. Apigee

3 Reasons You Need an API Gateway for Microservices Apps

Ambient Mesh Lab: SPIRE integration with Gloo Mesh in Istio Ambient Mode

Ambient Mesh Lab: Introduction to ztunnel in Ambient Mesh

Solo Academy Course: Service Mesh Basics

Solo Academy Course: Istio Basics

Solo Academy Course: Envoy Basics

Solo Academy Course: API Gateway Basics

Solo Academy Course: Get Started with Istio Service Mesh

Solo Academy Course: Introduction to Envoy Proxy

Solo Academy Course: Deploying Istio for Production

Kgateway Lab: Integrating kgateway with Istio at Ingress