As generative AI models increasingly power your apps, managing API traffic becomes more than a technical concern — it’s a business-critical necessity. LLMs like ChatGPT’s 4o and Claude from Anthropic are powerful, but they’re also compute-intensive and costly to run at scale.
That’s where rate limiting comes in.
About rate limiting
Typically, rate limiting controls how many requests a client can send to your API in a given time window, such as 100 requests per minute. Request limits are often tied to tiered plans, such as freemium vs. paid users. To learn more about general rate limiting in Gloo Gateway, refer to the docs.
Benefits of rate limiting AI requests
LLMs have certain characteristics that make rate limiting AI requests even more important than regular API requests.
- Cost Per Request: AI model queries can be orders of magnitude more expensive than typical API calls.
- Variable Load: Request complexity varies based on the tokens and models used.
- Multi-Tenant Use: APIs often serve many users and organizations with different plans. Repetitive requests can not only drain resources but also result in confusion due to the unpredictability of LLM outputs.
- Abuse Prevention: Without limits, malicious actors could drain resources or rack up cloud compute costs.
Token-based rate limits
In AI contexts, request-based rate limiting is not always a fair way to control usage. AI providers charge for usage based on the number of input and output tokens. Input tokens vary with the number of user and system prompts included in the request. Output tokens vary with the length and complexity of the response from the model. As you can tell, a single request can vary greatly in its token usage. Therefore, rate limiting strictly by the number of requests does not guarantee the control and experience that you want to give users of your apps.
Gloo Gateway automatically adjusts the units from requests to tokens when you apply a rate limit to a route that is served by an AI Gateway.
Tier-based rate limits
To take your rate limit a step further, you can set up a tiered approach to rate limiting. Rather than enforcing a single flat limit, tiered rate limiting applies multiple overlapping limits across different time intervals — typically minute, hour, and day.
With this approach, a user can’t spike traffic suddenly, use up too much during certain hours of intense activity, or exceed a daily quota. Such tiers protect your AI workloads at every timescale, ensuring stable performance and fair sharing of resources.
Example scenario
Let’s take a look at how to set up a tiered-based rate limit with Gloo AI Gateway, as shown in the following Figure.

Before you begin
Make sure that you have an enterprise Gloo Gateway environment set up with AI Gateway and your LLM provider. If you don’t, check out these docs:
Step 1: Authenticate users with JWTs
To set up user-based rate limiting, you first have to identify users somehow. With Gloo AI Gateway, you can authenticate users based on JSON Web Tokens (JWTs) that they present in their requests.
Let’s use a simple, self-signed JWT as an example. A sample user, Alice, has a self-signed JWT with the following configuration. She works in the dev team and has access to the gpt-3.5-turbo LLM.
{
"iss": "solo.io",
"org": "solo.io",
"sub": "alice",
"team": "dev",
"llms": {
"openai": [
"gpt-3.5-turbo"
]
}
}
Go ahead and save her JWT as an environment variable that you can use in requests later.
export ALICE_TOKEN=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyAiaXNzIjogInNvbG8uaW8iLCAib3JnIjogInNvbG8uaW8iLCAic3ViIjogImFsaWNlIiwgInRlYW0iOiAiZGV2IiwgImxsbXMiOiB7ICJvcGVuYWkiOiBbICJncHQtMy41LXR1cmJvIiBdIH0gfQ.I7whTti0aDKxlILc5uLK9oo6TljGS6JUrjPVd6z1PxzucUa_cnuKkY0qj_wrkzyVN5djy4t2ggE1uBO8Llpwi-Ygru9hM84-1m53aO07JYFya1VTDsI25tCRG8rYhShDdAP5L935SIARta2QtHhrVcd1Ae7yfTDZ8G1DXLtjR2QelszCd2R8PioCQmqJ8PeKg4sURhu05GlBCZoXES9-rtPVbe6j3YLBTodJAvLHhyy3LgV_QbN7IiZ5qEywdKHoEF4D4aCUf_LqPp4NoqHXnGT4jLzWJEtZXHQ4sgRy_5T93NOLzWLdIjgMjGO_F0aVLwBzU-phykOVfcBPaMvetg
To enforce that users must include a valid JWT in requests, apply a VirtualHostOption to your AI Gateway that you created before you began. The following example includes a public key that successfully validates Alice’s JWT.
kubectl apply -f- <<EOF
apiVersion: gateway.solo.io/v1
kind: VirtualHostOption
metadata:
name: jwt-provider
namespace: gloo-system
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: ai-gateway
options:
jwt:
providers:
selfminted:
issuer: solo.io
jwks:
local:
key: |
-----BEGIN PUBLIC KEY-----
MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAskFAGESgB22iOsGk/UgX
BXTmMtd8R0vphvZ4RkXySOIra/vsg1UKay6aESBoZzeLX3MbBp5laQenjaYJ3U8P
QLCcellbaiyUuE6+obPQVIa9GEJl37GQmZIMQj4y68KHZ4m2WbQVlZVIw/Uw52cw
eGtitLMztiTnsve0xtgdUzV0TaynaQrRW7REF+PtLWitnvp9evweOrzHhQiPLcdm
fxfxCbEJHa0LRyyYatCZETOeZgkOHlYSU0ziyMhHBqpDH1vzXrM573MQ5MtrKkWR
T4ZQKuEe0Acyd2GhRg9ZAxNqs/gbb8bukDPXv4JnFLtWZ/7EooKbUC/QBKhQYAsK
bQIDAQAB
-----END PUBLIC KEY-----
EOF
Now, requests to the AI Gateway must include the JWT to succeed.
Unsuccessful request without a JWT:
curl -v "${INGRESS_GW_ADDRESS}:8080/openai" -H content-type:application/json -d '{
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "system",
"content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."
},
{
"role": "user",
"content": "Compose a poem that explains the concept of recursion in programming."
}
]
}' | jq
Example response that returns a 401 Unauthorized message:
* Connected to XX.XXX.XX.XXX (XX.XXX.XX.XXX) port 8080 (#0)
> POST /openai HTTP/1.1
> Host: XX.XXX.XX.XXX:8080
> User-Agent: curl/7.88.1
> Accept: */*
> content-type:application/json
> Content-Length: 330
>
} [330 bytes data]
< HTTP/1.1 401 Unauthorized
< www-authenticate: Bearer realm="http://XX.XXX.XX.XXX:8080/openai"
< content-type: text/plain
< date: Thu, 03 Oct 2024 14:59:51 GMT
< server: envoy
< transfer-encoding: chunked
<
{ [24 bytes data]
100 344 0 14 100 330 81 1925 --:--:-- --:--:-- --:--:-- 2072
* Connection #0 to host XX.XXX.XX.XXX left intact
Successful request with Alice’s JWT:
curl "$INGRESS_GW_ADDRESS:8080/openai" -H "Authorization: Bearer $ALICE_TOKEN" -H content-type:application/json -d '{
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "system",
"content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."
},
{
"role": "user",
"content": "Compose a poem that explains the concept of recursion in programming."
}
]
}' | jq
Example response:
{
"id": "chatcmpl-AEHdIbIIY5fRbeMwWlg30g086vPGp",
"object": "chat.completion",
"created": 1727967736,
"model": "gpt-3.5-turbo-0125",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "In the realm of code, a concept unique,\nLies recursion, magic technique.\nA function that calls itself, a loop profound,\nUnraveling mysteries, in cycles bound.\n\nThrough countless calls, it travels deep,\nInto the realms of logic, where secrets keep.\nLike a mirror reflecting its own reflection,\nRecursion dives into its own inception.\n\nBase cases like roots in soil below,\nPreventing infinite loops that can grow,\nBreaking the chain of repetition's snare,\nGuiding the function with utmost care.\n\nA recursive dance, elegant and precise,\nSolving problems with coding advice.\nInfinite possibilities, layers untold,\nRecursion, a story waiting to unfold.",
"refusal": null
},
...
Good job! You have set up user authentication for requests to the AI Gateway. Now, you’re ready to apply rate limits per user.
Step 2: Define tiered RateLimitConfigs
You want to enforce rate limits based not only on AI token usage, but also on the particular user who sends the request. This way, the token usage is restricted by user. You can do that by extracting the user “sub” data from the JWT in the request. The following example RateLimitConfig uses dynamic metadata in the rate limit and sets a limit of 100 tokens per minute.
kubectl apply -f- <<EOF
apiVersion: ratelimit.solo.io/v1alpha1
kind: RateLimitConfig
metadata:
name: per-user-counter-minute
namespace: gloo-system
spec:
raw:
descriptors:
- key: user-id
rateLimit:
requestsPerUnit: 100
unit: MINUTE
rateLimits:
- actions:
- metadata:
descriptorKey: user-id
source: DYNAMIC
default: unknown
metadataKey:
key: "envoy.filters.http.jwt_authn"
path:
- key: principal
- key: sub
EOF
Remember that you also want to set up tiers of rate limiting, to protect not just bursty requests per minute, but also per hour and per day. To do so, you need to apply two more RateLimitConfigs as follows.
kubectl apply -f- <<EOF
apiVersion: ratelimit.solo.io/v1alpha1
kind: RateLimitConfig
metadata:
name: per-user-counter-hour
namespace: gloo-system
spec:
raw:
descriptors:
- key: user-id
rateLimit:
requestsPerUnit: 1000
unit: HOUR
rateLimits:
- actions:
- metadata:
descriptorKey: user-id
source: DYNAMIC
default: unknown
metadataKey:
key: "envoy.filters.http.jwt_authn"
path:
- key: principal
- key: sub
---
apiVersion: ratelimit.solo.io/v1alpha1
kind: RateLimitConfig
metadata:
name: per-user-counter-day
namespace: gloo-system
spec:
raw:
descriptors:
- key: user-id
rateLimit:
requestsPerUnit: 10000
unit: DAY
rateLimits:
- actions:
- metadata:
descriptorKey: user-id
source: DYNAMIC
default: unknown
metadataKey:
key: "envoy.filters.http.jwt_authn"
path:
- key: principal
- key: sub
EOF
Step 3: Apply rate limiting to the route
Now that you have your three tiers of rate limits configured, you have to apply the policy to your routes. To do so, create a RouteOption such as follows. The RouteOption attaches the three RateLimitConfigs to the openai HTTPRoute that you configured before you began.
apiVersion: gateway.solo.io/v1
kind: RouteOption
metadata:
name: rlc-route-option
namespace: gloo-system
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: HTTPRoute
name: openai
options:
rateLimitConfigs:
refs:
- name: per-user-counter-minute
namespace: gloo-system
- name: per-user-counter-hour
namespace: gloo-system
- name: per-user-counter-day
namespace: gloo-system
If you repeat your request with Alice’s JWT, you get back a successful response!
curl "$INGRESS_GW_ADDRESS:8080/openai" -H "Authorization: Bearer $ALICE_TOKEN" -H content-type:application/json -d '{
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "system",
"content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."
},
{
"role": "user",
"content": "Compose a poem that explains the concept of recursion in programming."
}
]
}' | jq
Example response that includes the token usage.
{
"id": "chatcmpl-9bLT1ofadlXEMpo53LcGjHsv3S5Ry",
"object": "chat.completion",
"created": 1718687683,
"model": "gpt-3.5-turbo-0125",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "In the realm of code, a concept so divine,\nRecursion weaves patterns, like nature's design.\nA function that calls itself, with purpose and grace,\nIt solves problems complex, with elegance and pace.\n\nLike a mirror reflecting its own reflection,\nRecursion repeats with boundless affection.\nEach iteration holds a story untold,\nUnraveling mysteries, a journey unfold.\n\nInfinite loops, a dangerous abyss,\nRecursion beckons with a siren's sweet kiss.\nBase case in"
},
"logprobs": null,
"finish_reason": "length"
}
],
"usage": {
"prompt_tokens": 39,
"completion_tokens": 100,
"total_tokens": 139
},
"system_fingerprint": null
}
If you keep repeating the request so that it exceeds the limit of 100 per minute, 1,000 per hour, or 10,000 per day, the request is denied with a 429 Too Many Requests error.
Conclusion
Good work! You set up rate limits in several ways that are critical to protecting your AI traffic: by user, by token usage, and in tiers.
For more information, get a free demo or check out the Gloo AI Gateway docs, and let us know how it goes.