Protect your AI-powered apps with tiered rate limiting in Gloo AI Gateway

As generative AI models increasingly power your apps, managing API traffic becomes more than a technical concern — it’s a business-critical necessity. LLMs like ChatGPT’s 4o and Claude from Anthropic are powerful, but they’re also compute-intensive and costly to run at scale.

That’s where rate limiting comes in.

About rate limiting

Typically, rate limiting controls how many requests a client can send to your API in a given time window, such as 100 requests per minute. Request limits are often tied to tiered plans, such as freemium vs. paid users. To learn more about general rate limiting in Gloo Gateway, refer to the docs.

Benefits of rate limiting AI requests

LLMs have certain characteristics that make rate limiting AI requests even more important than regular API requests.

  • Cost Per Request: AI model queries can be orders of magnitude more expensive than typical API calls.
  • Variable Load: Request complexity varies based on the tokens and models used.
  • Multi-Tenant Use: APIs often serve many users and organizations with different plans. Repetitive requests can not only drain resources but also result in confusion due to the unpredictability of LLM outputs.
  • Abuse Prevention: Without limits, malicious actors could drain resources or rack up cloud compute costs.

Token-based rate limits

In AI contexts, request-based rate limiting is not always a fair way to control usage. AI providers charge for usage based on the number of input and output tokens. Input tokens vary with the number of user and system prompts included in the request. Output tokens vary with the length and complexity of the response from the model. As you can tell, a single request can vary greatly in its token usage. Therefore, rate limiting strictly by the number of requests does not guarantee the control and experience that you want to give users of your apps.

Gloo Gateway automatically adjusts the units from requests to tokens when you apply a rate limit to a route that is served by an AI Gateway.

Tier-based rate limits

To take your rate limit a step further, you can set up a tiered approach to rate limiting. Rather than enforcing a single flat limit, tiered rate limiting applies multiple overlapping limits across different time intervals — typically minute, hour, and day.

With this approach, a user can’t spike traffic suddenly, use up too much during certain hours of intense activity, or exceed a daily quota. Such tiers protect your AI workloads at every timescale, ensuring stable performance and fair sharing of resources.

Table 1

Tier Limit per user
Purpose
Minute 100 tokens/minute Protect against short-term bursts of traffic.
Hour
1,000 tokens/hour Control sustained, high traffic during critical peak hours during a day.
Day 10,000 tokens/day Enforce overall consumption limits, such as due to a user pricing plan.
Made with HTML Tables

Example scenario

Let’s take a look at how to set up a tiered-based rate limit with Gloo AI Gateway, as shown in the following Figure.

Figure 1: Tiered rate limiting with Gloo AI Gateway.

Before you begin

Make sure that you have an enterprise Gloo Gateway environment set up with AI Gateway and your LLM provider. If you don’t, check out these docs:

  1. Get started with Gloo Gateway
  2. Set up AI Gateway
  3. Authenticate to the LLM provider

Step 1: Authenticate users with JWTs

To set up user-based rate limiting, you first have to identify users somehow. With Gloo AI Gateway, you can authenticate users based on JSON Web Tokens (JWTs) that they present in their requests.

Let’s use a simple, self-signed JWT as an example. A sample user, Alice, has a self-signed JWT with the following configuration. She works in the dev team and has access to the gpt-3.5-turbo LLM.

{
  "iss": "solo.io",
  "org": "solo.io",
  "sub": "alice",
  "team": "dev",
  "llms": {
    "openai": [
      "gpt-3.5-turbo"
    ]
  }
}


Go ahead and save her JWT as an environment variable that you can use in requests later.

export ALICE_TOKEN=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyAiaXNzIjogInNvbG8uaW8iLCAib3JnIjogInNvbG8uaW8iLCAic3ViIjogImFsaWNlIiwgInRlYW0iOiAiZGV2IiwgImxsbXMiOiB7ICJvcGVuYWkiOiBbICJncHQtMy41LXR1cmJvIiBdIH0gfQ.I7whTti0aDKxlILc5uLK9oo6TljGS6JUrjPVd6z1PxzucUa_cnuKkY0qj_wrkzyVN5djy4t2ggE1uBO8Llpwi-Ygru9hM84-1m53aO07JYFya1VTDsI25tCRG8rYhShDdAP5L935SIARta2QtHhrVcd1Ae7yfTDZ8G1DXLtjR2QelszCd2R8PioCQmqJ8PeKg4sURhu05GlBCZoXES9-rtPVbe6j3YLBTodJAvLHhyy3LgV_QbN7IiZ5qEywdKHoEF4D4aCUf_LqPp4NoqHXnGT4jLzWJEtZXHQ4sgRy_5T93NOLzWLdIjgMjGO_F0aVLwBzU-phykOVfcBPaMvetg


To enforce that users must include a valid JWT in requests, apply a VirtualHostOption to your AI Gateway that you created before you began. The following example includes a public key that successfully validates Alice’s JWT.

kubectl apply -f- <<EOF
apiVersion: gateway.solo.io/v1
kind: VirtualHostOption
metadata:
  name: jwt-provider
  namespace: gloo-system
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: ai-gateway
  options:
    jwt:
      providers:
        selfminted:
          issuer: solo.io
          jwks:
            local:
              key: |
                -----BEGIN PUBLIC KEY-----
                MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAskFAGESgB22iOsGk/UgX
                BXTmMtd8R0vphvZ4RkXySOIra/vsg1UKay6aESBoZzeLX3MbBp5laQenjaYJ3U8P
                QLCcellbaiyUuE6+obPQVIa9GEJl37GQmZIMQj4y68KHZ4m2WbQVlZVIw/Uw52cw
                eGtitLMztiTnsve0xtgdUzV0TaynaQrRW7REF+PtLWitnvp9evweOrzHhQiPLcdm
                fxfxCbEJHa0LRyyYatCZETOeZgkOHlYSU0ziyMhHBqpDH1vzXrM573MQ5MtrKkWR
                T4ZQKuEe0Acyd2GhRg9ZAxNqs/gbb8bukDPXv4JnFLtWZ/7EooKbUC/QBKhQYAsK
                bQIDAQAB
                -----END PUBLIC KEY-----
EOF


Now, requests to the AI Gateway must include the JWT to succeed.

Unsuccessful request without a JWT: 

curl -v "${INGRESS_GW_ADDRESS}:8080/openai" -H content-type:application/json -d '{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "system",
      "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."
    },
    {
      "role": "user",
      "content": "Compose a poem that explains the concept of recursion in programming."
    }
  ]
}' | jq

Example response that returns a 401 Unauthorized message:

* Connected to XX.XXX.XX.XXX (XX.XXX.XX.XXX) port 8080 (#0)
> POST /openai HTTP/1.1
> Host: XX.XXX.XX.XXX:8080
> User-Agent: curl/7.88.1
> Accept: */*
> content-type:application/json
> Content-Length: 330
>
} [330 bytes data]
< HTTP/1.1 401 Unauthorized
< www-authenticate: Bearer realm="http://XX.XXX.XX.XXX:8080/openai"
< content-type: text/plain
< date: Thu, 03 Oct 2024 14:59:51 GMT
< server: envoy
< transfer-encoding: chunked
<
{ [24 bytes data]
100   344    0    14  100   330     81   1925 --:--:-- --:--:-- --:--:--  2072
* Connection #0 to host XX.XXX.XX.XXX left intact


Successful request with Alice’s JWT:

curl "$INGRESS_GW_ADDRESS:8080/openai" -H "Authorization: Bearer $ALICE_TOKEN" -H content-type:application/json -d '{
 "model": "gpt-3.5-turbo",
 "messages": [
   {
     "role": "system",
     "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."
   },
   {
     "role": "user",
     "content": "Compose a poem that explains the concept of recursion in programming."
   }
 ]
}' | jq


Example response:

{
"id": "chatcmpl-AEHdIbIIY5fRbeMwWlg30g086vPGp",
"object": "chat.completion",
"created": 1727967736,
"model": "gpt-3.5-turbo-0125",
"choices": [
  {
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "In the realm of code, a concept unique,\nLies recursion, magic technique.\nA function that calls itself, a loop profound,\nUnraveling mysteries, in cycles bound.\n\nThrough countless calls, it travels deep,\nInto the realms of logic, where secrets keep.\nLike a mirror reflecting its own reflection,\nRecursion dives into its own inception.\n\nBase cases like roots in soil below,\nPreventing infinite loops that can grow,\nBreaking the chain of repetition's snare,\nGuiding the function with utmost care.\n\nA recursive dance, elegant and precise,\nSolving problems with coding advice.\nInfinite possibilities, layers untold,\nRecursion, a story waiting to unfold.",
      "refusal": null
    },
    ...


Good job! You have set up user authentication for requests to the AI Gateway. Now, you’re ready to apply rate limits per user.

Step 2: Define tiered RateLimitConfigs

You want to enforce rate limits based not only on AI token usage, but also on the particular user who sends the request. This way, the token usage is restricted by user. You can do that by extracting the user “sub” data from the JWT in the request. The following example RateLimitConfig uses dynamic metadata in the rate limit and sets a limit of 100 tokens per minute.

kubectl apply -f- <<EOF
apiVersion: ratelimit.solo.io/v1alpha1
kind: RateLimitConfig
metadata:
  name: per-user-counter-minute
  namespace: gloo-system
spec:
  raw:
    descriptors:
    - key: user-id
      rateLimit:
        requestsPerUnit: 100
        unit: MINUTE
    rateLimits:
    - actions:
      - metadata:
          descriptorKey: user-id
          source: DYNAMIC
          default: unknown
          metadataKey:
            key: "envoy.filters.http.jwt_authn"
            path:
            - key: principal
            - key: sub
EOF


Remember that you also want to set up tiers of rate limiting, to protect not just bursty requests per minute, but also per hour and per day. To do so, you need to apply two more RateLimitConfigs as follows.

kubectl apply -f- <<EOF
apiVersion: ratelimit.solo.io/v1alpha1
kind: RateLimitConfig
metadata:
  name: per-user-counter-hour
  namespace: gloo-system
spec:
  raw:
    descriptors:
    - key: user-id
      rateLimit:
        requestsPerUnit: 1000
        unit: HOUR
    rateLimits:
    - actions:
      - metadata:
          descriptorKey: user-id
          source: DYNAMIC
          default: unknown
          metadataKey:
            key: "envoy.filters.http.jwt_authn"
            path:
            - key: principal
            - key: sub
---
apiVersion: ratelimit.solo.io/v1alpha1
kind: RateLimitConfig
metadata:
  name: per-user-counter-day
  namespace: gloo-system
spec:
  raw:
    descriptors:
    - key: user-id
      rateLimit:
        requestsPerUnit: 10000
        unit: DAY
    rateLimits:
    - actions:
      - metadata:
          descriptorKey: user-id
          source: DYNAMIC
          default: unknown
          metadataKey:
            key: "envoy.filters.http.jwt_authn"
            path:
            - key: principal
            - key: sub
EOF

Step 3: Apply rate limiting to the route

Now that you have your three tiers of rate limits configured, you have to apply the policy to your routes. To do so, create a RouteOption such as follows. The RouteOption attaches the three RateLimitConfigs to the openai HTTPRoute that you configured before you began.

apiVersion: gateway.solo.io/v1
kind: RouteOption
metadata:
  name: rlc-route-option
  namespace: gloo-system
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: openai
  options:
    rateLimitConfigs:
      refs:
      - name: per-user-counter-minute
        namespace: gloo-system
      - name: per-user-counter-hour
        namespace: gloo-system
      - name: per-user-counter-day
        namespace: gloo-system


If you repeat your request with Alice’s JWT, you get back a successful response!

curl "$INGRESS_GW_ADDRESS:8080/openai" -H "Authorization: Bearer $ALICE_TOKEN" -H content-type:application/json -d '{
 "model": "gpt-3.5-turbo",
 "messages": [
   {
     "role": "system",
     "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."
   },
   {
     "role": "user",
     "content": "Compose a poem that explains the concept of recursion in programming."
   }
 ]
}' | jq


Example response that includes the token usage.

{
  "id": "chatcmpl-9bLT1ofadlXEMpo53LcGjHsv3S5Ry",
  "object": "chat.completion",
  "created": 1718687683,
  "model": "gpt-3.5-turbo-0125",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "In the realm of code, a concept so divine,\nRecursion weaves patterns, like nature's design.\nA function that calls itself, with purpose and grace,\nIt solves problems complex, with elegance and pace.\n\nLike a mirror reflecting its own reflection,\nRecursion repeats with boundless affection.\nEach iteration holds a story untold,\nUnraveling mysteries, a journey unfold.\n\nInfinite loops, a dangerous abyss,\nRecursion beckons with a siren's sweet kiss.\nBase case in"
      },
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 39,
    "completion_tokens": 100,
    "total_tokens": 139
  },
  "system_fingerprint": null
}


If you keep repeating the request so that it exceeds the limit of 100 per minute, 1,000 per hour, or 10,000 per day, the request is denied with a 429 Too Many Requests error.

Conclusion 

Good work! You set up rate limits in several ways that are critical to protecting your AI traffic: by user, by token usage, and in tiers.

For more information, get a free demo or check out the Gloo AI Gateway docs, and let us know how it goes.

Cloud connectivity done right