Monitor LLM usage with Gloo AI Gateway Consumption Reporting

Today, we'll discuss consumption reporting, a technique that allows you to extend the power of the built-in monitoring tools in Gloo AI Gateway. By implementing consumption reporting, you can track the details of LLM calls within your organization. Token usage can be monitored by AI provider and model, along with any custom labels relevant to your business.

What is consumption reporting?

Observability helps you understand how your system is performing, identify issues, and troubleshoot problems. Gloo AI Gateway provides a rich set of observability features that help you monitor and analyze the performance of your AI Gateway and the LLM providers that it interacts with.

While observability features like access logs are great for understanding individual requests, metrics are better for understanding the overall performance of your system. Gloo AI Gateway provides a rich set of default metrics that help you monitor and analyze the performance of your AI Gateway and the LLM providers that it interacts with. In addition, you can add custom labels to these metrics to develop a system of consumption reporting, which can help you better understand the context of the requests, such as the number of calls each team in your organization makes.

Why should you monitor token consumption?

There are several reasons why consumption reporting is valuable:

  • Cost management: Many LLM services, like OpenAI’s GPT models, charge based on the total number of tokens processed. Keeping an eye on token consumption helps you avoid unexpected costs, especially if you're running large queries or running multiple agentic workflows.
  • Token limits: Many models have token limits, meaning there's a maximum number of tokens that can be processed in a single request (including both the prompt and the generated output). Monitoring helps you ensure your inputs don't exceed these limits, avoiding errors, truncated responses, or the need for splitting data into multiple requests.
  • Scalability: If you're building a system that scales with user interaction (e.g., chatbots, content generation, or data analysis), monitoring token usage helps you predict the system's overall resource needs. It also helps identify whether you're using the model efficiently, especially if you're handling thousands or millions of requests.

Set up consumption reporting with Solo.io’s Gloo AI Gateway

Now that you understand the benefits of consumption reporting, let’s walk through how to implement it using Solo.io’s Gloo AI Gateway.

Prerequisites

Before getting started, make sure you have the following:

Set up the Gloo AI Gateway

If you haven’t already, start by setting up your Gloo AI Gateway, and authenticating the gateway with your AI provider. To get started, follow these Gloo AI Gateway docs:

Default metrics

Let’s start by taking a look at the default metrics that the system outputs, based on some simple requests to the AI provider.

First, get the external address for your AI gateway.

export INGRESS_GW_ADDRESS=$(kubectl get svc -n gloo-system gloo-proxy-ai-gateway -o jsonpath="{.status.loadBalancer.ingress[0]['hostname','ip']}")
echo $INGRESS_GW_ADDRESS

For this demo, let's try asking our AI provider to write a poem.

curl -v "${INGRESS_GW_ADDRESS}:8080/openai" -H content-type:application/json -d '{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "system",
      "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."
    },
    {
      "role": "user",
      "content": "Compose a poem that explains the concept of recursion in programming."
    }
  ]
}' | jq

You'll likely get a response similar to this one:

{
  "id": "chatcmpl-AEHdIbIIY5fRbeMwWlg30g086vPGp",
  "object": "chat.completion",
  "created": 1727967736,
  "model": "gpt-3.5-turbo-0125",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "In the realm of code, a concept unique,\nLies recursion, magic technique.\nA function that calls itself, a loop profound,\nUnraveling mysteries, in cycles bound.\n\nThrough countless calls, it travels deep,\nInto the realms of logic, where secrets keep.\nLike a mirror reflecting its own reflection,\nRecursion dives into its own inception.\n\nBase cases like roots in soil below,\nPreventing infinite loops that can grow,\nBreaking the chain of repetition's snare,\nGuiding the function with utmost care.\n\nA recursive dance, elegant and precise,\nSolving problems with coding advice.\nInfinite possibilities, layers untold,\nRecursion, a story waiting to unfold.",
        "refusal": null
      },
      ...

Now we can check the default metrics that were collected for this request and response. In another tab in your terminal, port-forward the ai-gateway container of the gateway proxy.

kubectl port-forward -n gloo-system deploy/gloo-proxy-ai-gateway 9092

In the previous tab, run the following command to view the metrics.

curl localhost:9092/metrics

In the output, search for these metrics:

  • ai_completion_tokens_total
  • ai_prompt_tokens_total

These metrics total the number of tokens used in the prompt and completion for the openai model gpt-3.5-turbo.

# HELP ai_completion_tokens_total Completion tokens
# TYPE ai_completion_tokens_total counter
ai_completion_tokens_total{llm="openai",model="gpt-3.5-turbo"} 539.0
...
# HELP ai_prompt_tokens_total Prompt tokens
# TYPE ai_prompt_tokens_total counter
ai_prompt_tokens_total{llm="openai",model="gpt-3.5-turbo"} 204.0

These metrics tell you how many tokens have been processed for all LLM API requests that have been routed through your AI Gateway. Additionally, these metrics can help you identify which providers and which models are being called to process requests.

Custom metrics

In the previous step, you reviewed metrics for total token usage. These default metrics can help you ensure your organization doesn’t exceed limits that you want to implement across all teams and organizations.

But what if you need to review more specific usage? Default metrics are useful for gauging LLM usage over time, but don’t help you understand usage by each team. You can add that context by creating custom labels.

In this example, you gather and observe key metrics related to teams’ LLM provider usage based on extracting claims from JWT tokens for two users, Alice and Bob.

Authenticate users with JWTs

Start by creating a VirtualHostOption resource to define an inline JWT provider. The JWT provider is used to validate the JWTs that are sent as part of the requests to the Gloo AI Gateway. In this example, the JWT provider validates the JWT by using the public key that you add to the VirtualHostOption resource.

kubectl apply -f- <<EOF
apiVersion: gateway.solo.io/v1
kind: VirtualHostOption
metadata:
  name: jwt-provider
  namespace: gloo-system
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: ai-gateway
  options:
    jwt:
      providers:
        selfminted:
          issuer: solo.io
          jwks:
            local:
              key: |
                -----BEGIN PUBLIC KEY-----
                MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAskFAGESgB22iOsGk/UgX
                BXTmMtd8R0vphvZ4RkXySOIra/vsg1UKay6aESBoZzeLX3MbBp5laQenjaYJ3U8P
                QLCcellbaiyUuE6+obPQVIa9GEJl37GQmZIMQj4y68KHZ4m2WbQVlZVIw/Uw52cw
                eGtitLMztiTnsve0xtgdUzV0TaynaQrRW7REF+PtLWitnvp9evweOrzHhQiPLcdm
                fxfxCbEJHa0LRyyYatCZETOeZgkOHlYSU0ziyMhHBqpDH1vzXrM573MQ5MtrKkWR
                T4ZQKuEe0Acyd2GhRg9ZAxNqs/gbb8bukDPXv4JnFLtWZ/7EooKbUC/QBKhQYAsK
                bQIDAQAB
                -----END PUBLIC KEY-----
EOF

Next, create some JWT tokens for the example users Alice and Bob in environment variables. The JWT tokens include the name of the teams that each user works on.

  • Alice works on the dev team:
export ALICE_TOKEN=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyAiaXNzIjogInNvbG8uaW8iLCAib3JnIjogInNvbG8uaW8iLCAic3ViIjogImFsaWNlIiwgInRlYW0iOiAiZGV2IiwgImxsbXMiOiB7ICJvcGVuYWkiOiBbICJncHQtMy41LXR1cmJvIiBdIH0gfQ.I7whTti0aDKxlILc5uLK9oo6TljGS6JUrjPVd6z1PxzucUa_cnuKkY0qj_wrkzyVN5djy4t2ggE1uBO8Llpwi-Ygru9hM84-1m53aO07JYFya1VTDsI25tCRG8rYhShDdAP5L935SIARta2QtHhrVcd1Ae7yfTDZ8G1DXLtjR2QelszCd2R8PioCQmqJ8PeKg4sURhu05GlBCZoXES9-rtPVbe6j3YLBTodJAvLHhyy3LgV_QbN7IiZ5qEywdKHoEF4D4aCUf_LqPp4NoqHXnGT4jLzWJEtZXHQ4sgRy_5T93NOLzWLdIjgMjGO_F0aVLwBzU-phykOVfcBPaMvetg
  • Bob works on the ops team:
export BOB_TOKEN=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyAiaXNzIjogInNvbG8uaW8iLCAib3JnIjogInNvbG8uaW8iLCAic3ViIjogImJvYiIsICJ0ZWFtIjogIm9wcyIsICJsbG1zIjogeyAibWlzdHJhbGFpIjogWyAibWlzdHJhbC1sYXJnZS1sYXRlc3QiIF0gfSB9.p7J2UFwnUJ6C7eXsFCSKb5b7ecWZ75JO4TUJHafjLv8jJ7GzKfJVk7ney19PYUrWrO4ntwnnK5_sY7yaLUBCJ3fv9pcoKyRtJTw1VMMTQsKkWFgvy-jEwc9M-D5lrUfR1HXGEUm6NBaj_Ja78XScPZb_-APPqMIvzDZU04vd6hna3UMc4DZE0wcnTjOqoND0GllHLupYTfgX0v9_AYJiKRAcJvol1W14dI7szpY5GFZtPqq0kl1g0sJPg-HQKwf7Cfvr_JLjkepNJ6A1lsrG8QbuUvMUAdaHzwLvF3L_G6VRjEte6okZpaq0g2urWpZgdNmPVN71Q_0WhyrJTr6SyQ

Make sure that they can still access the AI API, now that it is protected with a JWT provider. For example, send another poem request to the LLM, but this time be sure to include the JWT token for Alice in the Authorization header. 

curl -v "${INGRESS_GW_ADDRESS}:8080/openai" -H "Authorization: Bearer $ALICE_TOKEN" -H content-type:application/json -d '{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "system",
      "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."
    },
    {
      "role": "user",
      "content": "Compose a poem that explains the concept of recursion in programming."
    }
  ]
}' | jq

Because Alice's JWT is successfully validated, access to the AI API is granted. You should see a similar response to the one you received earlier.

{
  "id": "chatcmpl-AEHdIbIIY5fRbeMwWlg30g086vPGp",
  "object": "chat.completion",
  "created": 1727967736,
  "model": "gpt-3.5-turbo-0125",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "In the realm of code, a concept unique,\nLies recursion, magic technique.\nA function that calls itself, a loop profound,\nUnraveling mysteries, in cycles bound.\n\nThrough countless calls, it travels deep,\nInto the realms of logic, where secrets keep.\nLike a mirror reflecting its own reflection,\nRecursion dives into its own inception.\n\nBase cases like roots in soil below,\nPreventing infinite loops that can grow,\nBreaking the chain of repetition's snare,\nGuiding the function with utmost care.\n\nA recursive dance, elegant and precise,\nSolving problems with coding advice.\nInfinite possibilities, layers untold,\nRecursion, a story waiting to unfold.",
        "refusal": null
      },
      ...

Track token usage based on team

Now, you can create custom metrics labels that help you track token usage by team.

To add custom labels to the metrics, update the GatewayParameters resource. In the stats.customLabels section, add a list of labels that contain the name of the label and the dynamic metadata field to get the label from.

kubectl apply -f- <<EOF
apiVersion: gateway.gloo.solo.io/v1alpha1
kind: GatewayParameters
metadata:
  name: gloo-gateway-override
  namespace: gloo-system
spec:
  kube:
    aiExtension:
      enabled: true
      stats:
        customLabels:
          - name: "team"
            metadataKey: "principal:team"
EOF

In this resource, the team label sources from the team field of the JWT token. The metadata namespace defaults to the namespace where you defined the JWT provider, but you can specify a different namespace if you have a different source of metadata. When you apply this resource, the gateway proxy restarts to pick up the new stats configuration.

Let’s try sending some requests again to generate metrics data. Send one request with Alice’s token, and one with Bob’s token.

curl "$INGRESS_GW_ADDRESS:8080/openai" -H "Authorization: Bearer $ALICE_TOKEN" -H content-type:application/json -d '{
"model": "gpt-3.5-turbo",
"messages": [
  {
    "role": "user",
    "content": "Please explain the movie Dr. Strangelove in 1 sentence."
  }
]
}'
curl "$INGRESS_GW_ADDRESS:8080/openai" -H "Authorization: Bearer $BOB_TOKEN" -H content-type:application/json -d '{
"model": "gpt-3.5-turbo",
"messages": [
  {
    "role": "user",
    "content": "Please explain the movie Dr. Strangelove in 1 sentence."
  }
]
}'

In your port-forwarding tab, run the port-forward command again for the ai-gateway container of the gateway proxy.

kubectl port-forward -n gloo-system deploy/gloo-proxy-ai-gateway 9092

In the previous tab, check the metrics again.

curl localhost:9092/metrics

In the output, search for the same metrics that total the number of tokens used in the prompt and completion for the openai model gpt-3.5-turbo:

  • ai_completion_tokens_total
  • ai_prompt_tokens_total

This time, the metrics are broken out on a team-by-team basis. This way, you can easily check how many tokens Alice’s dev team is using, and how many Bob’s ops team is using!

# HELP ai_completion_tokens_total Completion tokens
# TYPE ai_completion_tokens_total counter
ai_prompt_tokens_total{llm="openai",model="gpt-3.5-turbo",team="dev"} 21.0
ai_prompt_tokens_total{llm="openai",model="gpt-3.5-turbo",team="ops"} 21.0
...
# HELP ai_prompt_tokens_total Prompt tokens
# TYPE ai_prompt_tokens_total counter
ai_completion_tokens_total{llm="openai",model="gpt-3.5-turbo",team="dev"} 18.0
ai_completion_tokens_total{llm="openai",model="gpt-3.5-turbo",team="ops"} 30.0

Conclusion

In conclusion, implementing consumption reporting with Gloo AI Gateway is an effective way to manage and monitor token usage across your organization.

By utilizing default and custom metrics, your business can track LLM consumption by model, provider, and even specific teams, ensuring efficient resource usage and preventing cost overruns. With tools like JWT authentication and custom labels, teams can gain deeper insights into token usage, allowing for better scalability, cost management, and performance analysis. This approach not only enhances system observability but also empowers your organization to optimize your AI operations and maintain control over usage patterns.

We encourage you to explore Solo.io’s documentation for more detailed instructions and additional resources. Happy coding, and we hope this helps you get started with consumption reporting!

Cloud connectivity done right