Solving enterprise LLM challenges with NVIDIA NIM & Gloo AI Gateway

As enterprises adopt LLMs, they face challenges in cost control, security, governance, and observability. This blog explores how NVIDIA NIM microservices and Gloo AI Gateway help organizations scale LLM usage efficiently.

Acknowledgements

Thanks to Monika Katariya, Alex Steiner, and Adam Tetelman from NVIDIA and Lin Sun from Solo.io for their reviews of this blog.
‍

Although LLM usage in enterprise organizations presents a lot of opportunities, it comes with a number of concerns around scaling such as cost, data privacy, performance and resilience. For example, we have customers building internal chat systems for their employees using company-internal data to more quickly solve problems, offer solutions, and improve user experience. Common challenges these customers run into when implementing these scenarios include

Quickly onboarding new LLM models and safely switching between them
Enforcing guardrails and content moderation for data protection, compliance and security
Tracking LLM usage, preventing cost overrun, establishing quota usage and implementing model failover mechanisms
Deep observability into what LLM calls are happening, who’s consuming LLM tokens, debugging when things go wrong or slow down

In this blog we look at using two technologies designed to help solve these challenges: NVIDIA NIM deployed LLMs on Kubernetes and Gloo AI Gateway. NVIDIA NIM microservices offer a self-operated, hardware optimized, portable LLM inference solution. Gloo AI gateway brings routing controls, security, guardrails, and resilience for deployed NIM services.

Quickly Adopting New Models

Which model is the right one for your use case? The reality is, models are improving quickly and you will likely want to try a lot of them. We know of some organizations that are even building their own GPU farms and bringing model inference into their data centers. And that’s why NVIDIA built NIM microservices. With NIM, you can access curated AI models (e.g., LLMs, Vision, Speech, etc) which have been optimized to run on NVIDIA hardware in your own infrastructure, such as Kubernetes. If running on Kubernetes, you can use the GPU Operator and NIM Operator to deploy these models.

Quickly switching between managed AI API calls (e.g., OpenAI gpt-4o) and NIM microservices is simplified with Gloo AI Gateway. For example, a common use case is you’ve already started using OpenAI models and then decide to self-host and deploy a local LLama-3.1-8B model with NVIDIA NIM. Switching requests (either explicitly routing, or through a canary approach) can be done with Gloo AI Gateway’s traffic shifting capabilities. Let’s take a look at a sample configuration:

apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: openai
  namespace: gloo-system
spec:
  parentRefs:
    - name: ai-gateway
      namespace: gloo-system
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /openai
    filters:
      - type: URLRewrite
        urlRewrite:
          path:
            type: ReplaceFullPath
            replaceFullPath: /v1/chat/completions
    backendRefs:
      - group: gloo.solo.io
        kind: Upstream
        name: openai-nim
        namespace: gloo-system
        weight: 50
      - group: gloo.solo.io
        kind: Upstream
        name: openai
        namespace: gloo-system
        weight: 50

Gloo AI Gateway uses the open-standard Kubernetes Gateway API to specify routing to backend LLMs. In this example, a route is exposed on the /openai HTTP path and is redirected to a backend LLM. Routing can be specified by matching on content (ie, headers, body, etc) or by percentage as shown in the above example where we split the traffic 50/50 between OpenAI and NIM models.

See this quick demo to see percentage-based traffic splitting in action demonstrating a use case where an application has built on top of OpenAI and is now switching to LLama 3.1-8B with NIM:

Guardrails / Content Moderation

The organizations we work with are very risk adverse and have strict data protection and regulatory compliance to which they must adhere. These organizations may leverage hosted content moderation services such as OpenAI’s moderation service or building guardrails directly into their applications with libraries such as Presidio (or maybe using both!). One thing they’re very concerned about is what happens if there is a jailbreak not covered by one of those mechanisms? That is, they need a way to “kill switch” and apply mitigation quickly. Organizations that use an AI gateway for this can quickly do this without consulting developers or LLM providers.

If an organization is shifting traffic between models (ie, hosted LLMs vs NIM), they will want consistency for guardrails and content moderation. Moving off one model in favor of another, or moving from one hosted solution to a self-hosted solution, should not change the important guardrails that get applied.

NVIDIA NIM has a content safety LLM and a TopicControl LLM that you can use to implement organization-specific content moderation and data protection guardrails. You can host this content-safety (or topic control) LLM and tie this into the Gloo AI Gateway to get powerful guardrails for requests going to an NVIDIA NIM LLM (or any provider).

Configuring this guardrail in Gloo AI Gateway can be done by configuring a RouteOption rule and attaching it to the HTTPRoute. See the following example:

apiVersion: gateway.solo.io/v1
kind: RouteOption
metadata:
  name: openai-prompt-guard
  namespace: gloo-system
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: openai
  options:
    ai:
      promptGuard:
        request:
          webhook:
            forwardHeaders:
              - key: x
                matchType: PREFIX
            host: "nemo-content-moderation-service.default.svc.cluster.local"
            port: 80

In the above example, we configure Gloo AI Gateway with a RouteOption object which is an extension to the Kubernetes Gateway API which allows configuring more powerful routing features within the gateway. We configure this to callout to a NIM backed content-moderation guardrail service which we can apply to any LLM that gets called by any client. This ensures consistency for compliance reasons as well as offers a way to quickly enforce or change guardrail policies if a vulnerability is discovered.
‍

Quota Enforcement, Model Failover

Different teams may be using different LLM models across the organization, but without some central governance or AI Gateway, it’s difficult to control costs, and consistently enforce failover mechanisms or policies. Gloo AI gateway does two things to help with this. First, it can enforce client/team/organization rate limiting for LLMs using org-internal constructs (ie, such as internally issued API keys, or based on authentication mechanisms such as Oauth, LDAP, or other SSO). This rate limiting can be applied to token usage to help keep certain clients from overconsuming tokens on an LLM.

Second, Gloo AI gateway can automatically failover to other models when quota for a particular LLM has been exhausted. For example, team X can consume a specified number of tokens for an expensive model (ie, OpenAI o3) and when that is exhausted, fall back to o1, 4o, or to LLama 3.1 8B running on self hosted infrastructure.

apiVersion: gloo.solo.io/v1
kind: Upstream
metadata:
  labels:
    app: gloo
  name: model-failover
  namespace: gloo-system
spec:
  ai:
    multi:
      priorities:
      - pool:
        - openai:
            model: "gpt-4o"
            customHost:
              host: model-failover.gloo-system.svc.cluster.local
              port: 80
            authToken:
              secretRef:
                name: openai-secret
                namespace: gloo-system
      - pool:
        - openai:
            model: "meta/llama-3.1-405b"
            customHost:
              host: meta-llama3-405b.default.svc.cluster.local
              port: 80
            authToken:
              secretRef:
                name: openai-secret
                namespace: gloo-system
      - pool:
        - openai:
            customHost:
              host: meta-llama3-8b-instruct.default.svc.cluster.local
              port: 8000

In this example configuration for Gloo AI gateway, we specify what the failover list could look like. We will try model gpt-4o from OpenAI first, falling back to Llama-3.1-405B, and finally failing back to Llama3.1-8B, both deployed in NIM.

See the demo below for how we can do this with Gloo AI Gateway and NVIDIA NIM:

Observability, Tracing, Debugging

Who’s consuming tokens from the various LLMs? How performant are these calls? How can we debug if something behaves abnormally? Routing traffic through a Gloo AI gateway layer allows operators to consistently see very fine-grained usage telemetry without modifying applications or forcing specific libraries. For example, we can graph the typical golden-signal metrics such as error rate, saturation, and request per seconds through the gateway infrastructure as illustrated below:

But we can also graph the token usage by client ID, by team, or by organization. We can track token usage (ie, prompt vs completion) over time, tokens that get rate limited, or tokens that are returned as part of a semantic cache hit. See the following dashboards for an example:

Lastly, for calls to LLMs, we can also debug their flow through the gateway using distributed tracing. Gloo AI gateway supports zipkin/jager style tracing with metadata about token usage, models, called, and latency associated with each step in the processing:

To see this observability tooling in action, please take a look at the following demo:

Wrapping up

As enterprises accelerate their adoption of LLMs, the challenges of model selection, governance, security, cost control, and observability become critical. In this post, we explored how NVIDIA NIM microservices and Gloo AI Gateway can help organizations address these challenges, enabling quick experimentation and model adoption, enforcing guardrails, managing costs through quota enforcement and failover, and gaining deep visibility into LLM usage.

The combination of NIM for optimized, self-hosted inference and Gloo AI Gateway for traffic control, security, and observability provides a powerful foundation for scaling AI workloads in a way that is cost-efficient, compliant, and flexible. Whether you’re transitioning from OpenAI models to self-hosted LLaMA 3.1, enforcing organization-wide content moderation, or ensuring reliable failover mechanisms, these technologies make it possible to move fast without breaking things.

Want to see these concepts in action? Check out the demo links throughout this post and try out Gloo AI Gateway and NVIDIA NIM for yourself. Have thoughts or challenges around deploying LLMs at scale? We’d love to hear from you—reach out and let’s discuss!

Solving enterprise LLM challenges with NVIDIA NIM & Gloo AI Gateway

Acknowledgements

Quickly Adopting New Models

Guardrails / Content Moderation

Quota Enforcement, Model Failover

Observability, Tracing, Debugging

Wrapping up

Featured content

Agent Discovery, Naming, and Resolution - the Missing Pieces to A2A

Part Two: MCP Authorization The Hard Way

Part One: MCP Authorization The Hard Way

Agent Identity and Access Management - Can SPIFFE Work?

Deep Dive into llm-d and Distributed Inference

Gloo Mesh 2.8 simplifies service mesh operations with new enhanced user experience across multi-cluster environments.

Gloo Gateway 1.19 accelerates context-rich, real-time AI apps with Gateway API

llm-d: Distributed Inference Serving on Kubernetes

AI Reliability Engineering For More Dependable Humans

Kubernetes Identity the Right Way with SPIRE and Ambient

Optimizing GenAI in Production: High-Value Use Cases for AI Gateways

Solo.io Recognized as a Visionary in the 2024 Gartner® Magic Quadrant™ for API Management for the SECOND year in a row.

Guardians of the Governance: GenAI Gateway Guidance with GitOps and Gloo

Istio Ambient Waypoint Proxy explained

Hands-On with the Kubernetes Gateway API and Envoy Proxy: A Tutorial with GitOps and Gloo Gateway

Istio and the State of DevOps: Enhancing Key Metrics

What is an AI Gateway and its role in AI Applications?

Best practices for secure Istio deployment with Gloo Mesh Core

Gloo Mesh 2.6: Istio's Ambient mode now ready for production

HTTP Observability Without Compromises

Advance your knowledge of service mesh tech with Solo.io Academy certifications

Service Mesh for the developer workflow, a series

Challenges of adopting service mesh in enterprise organizations

Service Mesh in the Real World #2 — Ingress Traffic Control

Service Mesh in the Real World Video Series – Episode # 1: Egress Traffic

Service Mesh the easy way with AWS App Mesh and SuperGloo

Webinar Recap: Intro to Service Mesh Hub and SMI

D-TECK Uses Solo.io Gloo Gateway and Google Cloud to Help Businesses Make Better HR Decisions

Minimize the blast radius of changes with Solo.io Gloo Gateway and Weaveworks Flagger

Announcing Service Mesh Interface (SMI) Support and Collaboration

Service Mesh Interface (SMI) and our Vision for the Community and Ecosystem

The need for a standard, service mesh API

SuperGloo to the Rescue! Making it easier to write extensions for Service Mesh

Introducing The Service Mesh Hub -everything you need for your service mesh

Kubernetes Ingress Past, Present, and Future

Solo.io Streamlines Service Mesh and Serverless Adoption for Enterprises in Google Cloud

Ingenico

ParkMobile

Vonage

Domino’s Pizza

Gloo Mesh Feature Comparison

Service Mesh for Developers, Part 1: Exploring the Power of Observability and OpenTelemetry

Service Mesh at Scale

Compare Capabilities of the Top Service Mesh Platforms

Compare Capabilities of the Top API Gateways

Establishing zero trust security for modern cloud architectures

Unlocking the Power of Your API Gateway

API Gateways: Productivity, Resilience, and Security for Next-Generation Cloud Applications

Driving Business Value with Istio

Service Mesh Vendor Comparison

Istio Then & Now

4 Reasons Why You Need an AI Gateway

Gloo Gateway vs. Kong

Gloo Gateway vs. Apigee

3 Reasons You Need an API Gateway for Microservices Apps

Solo Academy Course: Service Mesh Basics

Solo Academy Course: Istio Basics

Solo Academy Course: Envoy Basics

Solo Academy Course: API Gateway Basics

Solo Academy Course: Get Started with Istio Service Mesh

Solo Academy Course: Introduction to Envoy Proxy

Solo Academy Course: Deploying Istio for Production

Kgateway Lab: Integrating kgateway with Istio at Ingress

Kgateway Lab: Kgateway as a Waypoint

Kgateway AI Lab: Consumption Reporting

Kgateway AI Lab: Deploying kgateway as an AI Gateway

Kagent Lab: How to build an AI agent

Kagent Lab: Integrate tools from MCP servers with kagent

Gloo AI Gateway Hands-On Lab: Semantic Caching

Kgateway AI Lab: Credentials Management

Kgateway AI Lab: Prompt Enrichment

Kgateway AI Lab: Prompt Guards