Acknowledgements
Thanks to Monika Katariya, Alex Steiner, and Adam Tetelman from NVIDIA and Lin Sun from Solo.io for their reviews of this blog.
Although LLM usage in enterprise organizations presents a lot of opportunities, it comes with a number of concerns around scaling such as cost, data privacy, performance and resilience. For example, we have customers building internal chat systems for their employees using company-internal data to more quickly solve problems, offer solutions, and improve user experience. Common challenges these customers run into when implementing these scenarios include
- Quickly onboarding new LLM models and safely switching between them
- Enforcing guardrails and content moderation for data protection, compliance and security
- Tracking LLM usage, preventing cost overrun, establishing quota usage and implementing model failover mechanisms
- Deep observability into what LLM calls are happening, who’s consuming LLM tokens, debugging when things go wrong or slow down
In this blog we look at using two technologies designed to help solve these challenges: NVIDIA NIM deployed LLMs on Kubernetes and Gloo AI Gateway. NVIDIA NIM microservices offer a self-operated, hardware optimized, portable LLM inference solution. Gloo AI gateway brings routing controls, security, guardrails, and resilience for deployed NIM services.
Quickly Adopting New Models
Which model is the right one for your use case? The reality is, models are improving quickly and you will likely want to try a lot of them. We know of some organizations that are even building their own GPU farms and bringing model inference into their data centers. And that’s why NVIDIA built NIM microservices. With NIM, you can access curated AI models (e.g., LLMs, Vision, Speech, etc) which have been optimized to run on NVIDIA hardware in your own infrastructure, such as Kubernetes. If running on Kubernetes, you can use the GPU Operator and NIM Operator to deploy these models.
Quickly switching between managed AI API calls (e.g., OpenAI gpt-4o) and NIM microservices is simplified with Gloo AI Gateway. For example, a common use case is you’ve already started using OpenAI models and then decide to self-host and deploy a local LLama-3.1-8B model with NVIDIA NIM. Switching requests (either explicitly routing, or through a canary approach) can be done with Gloo AI Gateway’s traffic shifting capabilities. Let’s take a look at a sample configuration:
apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
name: openai
namespace: gloo-system
spec:
parentRefs:
- name: ai-gateway
namespace: gloo-system
rules:
- matches:
- path:
type: PathPrefix
value: /openai
filters:
- type: URLRewrite
urlRewrite:
path:
type: ReplaceFullPath
replaceFullPath: /v1/chat/completions
backendRefs:
- group: gloo.solo.io
kind: Upstream
name: openai-nim
namespace: gloo-system
weight: 50
- group: gloo.solo.io
kind: Upstream
name: openai
namespace: gloo-system
weight: 50
Gloo AI Gateway uses the open-standard Kubernetes Gateway API to specify routing to backend LLMs. In this example, a route is exposed on the /openai HTTP path and is redirected to a backend LLM. Routing can be specified by matching on content (ie, headers, body, etc) or by percentage as shown in the above example where we split the traffic 50/50 between OpenAI and NIM models.
See this quick demo to see percentage-based traffic splitting in action demonstrating a use case where an application has built on top of OpenAI and is now switching to LLama 3.1-8B with NIM:

Guardrails / Content Moderation
The organizations we work with are very risk adverse and have strict data protection and regulatory compliance to which they must adhere. These organizations may leverage hosted content moderation services such as OpenAI’s moderation service or building guardrails directly into their applications with libraries such as Presidio (or maybe using both!). One thing they’re very concerned about is what happens if there is a jailbreak not covered by one of those mechanisms? That is, they need a way to “kill switch” and apply mitigation quickly. Organizations that use an AI gateway for this can quickly do this without consulting developers or LLM providers.

If an organization is shifting traffic between models (ie, hosted LLMs vs NIM), they will want consistency for guardrails and content moderation. Moving off one model in favor of another, or moving from one hosted solution to a self-hosted solution, should not change the important guardrails that get applied.
NVIDIA NIM has a content safety LLM and a TopicControl LLM that you can use to implement organization-specific content moderation and data protection guardrails. You can host this content-safety (or topic control) LLM and tie this into the Gloo AI Gateway to get powerful guardrails for requests going to an NVIDIA NIM LLM (or any provider).

Configuring this guardrail in Gloo AI Gateway can be done by configuring a RouteOption rule and attaching it to the HTTPRoute. See the following example:
apiVersion: gateway.solo.io/v1
kind: RouteOption
metadata:
name: openai-prompt-guard
namespace: gloo-system
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: HTTPRoute
name: openai
options:
ai:
promptGuard:
request:
webhook:
forwardHeaders:
- key: x
matchType: PREFIX
host: "nemo-content-moderation-service.default.svc.cluster.local"
port: 80
In the above example, we configure Gloo AI Gateway with a RouteOption object which is an extension to the Kubernetes Gateway API which allows configuring more powerful routing features within the gateway. We configure this to callout to a NIM backed content-moderation guardrail service which we can apply to any LLM that gets called by any client. This ensures consistency for compliance reasons as well as offers a way to quickly enforce or change guardrail policies if a vulnerability is discovered.
Quota Enforcement, Model Failover
Different teams may be using different LLM models across the organization, but without some central governance or AI Gateway, it’s difficult to control costs, and consistently enforce failover mechanisms or policies. Gloo AI gateway does two things to help with this. First, it can enforce client/team/organization rate limiting for LLMs using org-internal constructs (ie, such as internally issued API keys, or based on authentication mechanisms such as Oauth, LDAP, or other SSO). This rate limiting can be applied to token usage to help keep certain clients from overconsuming tokens on an LLM.
Second, Gloo AI gateway can automatically failover to other models when quota for a particular LLM has been exhausted. For example, team X can consume a specified number of tokens for an expensive model (ie, OpenAI o3) and when that is exhausted, fall back to o1, 4o, or to LLama 3.1 8B running on self hosted infrastructure.
apiVersion: gloo.solo.io/v1
kind: Upstream
metadata:
labels:
app: gloo
name: model-failover
namespace: gloo-system
spec:
ai:
multi:
priorities:
- pool:
- openai:
model: "gpt-4o"
customHost:
host: model-failover.gloo-system.svc.cluster.local
port: 80
authToken:
secretRef:
name: openai-secret
namespace: gloo-system
- pool:
- openai:
model: "meta/llama-3.1-405b"
customHost:
host: meta-llama3-405b.default.svc.cluster.local
port: 80
authToken:
secretRef:
name: openai-secret
namespace: gloo-system
- pool:
- openai:
customHost:
host: meta-llama3-8b-instruct.default.svc.cluster.local
port: 8000
In this example configuration for Gloo AI gateway, we specify what the failover list could look like. We will try model gpt-4o from OpenAI first, falling back to Llama-3.1-405B, and finally failing back to Llama3.1-8B, both deployed in NIM.

See the demo below for how we can do this with Gloo AI Gateway and NVIDIA NIM:
Observability, Tracing, Debugging
Who’s consuming tokens from the various LLMs? How performant are these calls? How can we debug if something behaves abnormally? Routing traffic through a Gloo AI gateway layer allows operators to consistently see very fine-grained usage telemetry without modifying applications or forcing specific libraries. For example, we can graph the typical golden-signal metrics such as error rate, saturation, and request per seconds through the gateway infrastructure as illustrated below:

But we can also graph the token usage by client ID, by team, or by organization. We can track token usage (ie, prompt vs completion) over time, tokens that get rate limited, or tokens that are returned as part of a semantic cache hit. See the following dashboards for an example:

Lastly, for calls to LLMs, we can also debug their flow through the gateway using distributed tracing. Gloo AI gateway supports zipkin/jager style tracing with metadata about token usage, models, called, and latency associated with each step in the processing:

To see this observability tooling in action, please take a look at the following demo:
Wrapping up
As enterprises accelerate their adoption of LLMs, the challenges of model selection, governance, security, cost control, and observability become critical. In this post, we explored how NVIDIA NIM microservices and Gloo AI Gateway can help organizations address these challenges, enabling quick experimentation and model adoption, enforcing guardrails, managing costs through quota enforcement and failover, and gaining deep visibility into LLM usage.
The combination of NIM for optimized, self-hosted inference and Gloo AI Gateway for traffic control, security, and observability provides a powerful foundation for scaling AI workloads in a way that is cost-efficient, compliant, and flexible. Whether you’re transitioning from OpenAI models to self-hosted LLaMA 3.1, enforcing organization-wide content moderation, or ensuring reliable failover mechanisms, these technologies make it possible to move fast without breaking things.
Want to see these concepts in action? Check out the demo links throughout this post and try out Gloo AI Gateway and NVIDIA NIM for yourself. Have thoughts or challenges around deploying LLMs at scale? We’d love to hear from you—reach out and let’s discuss!