No items found.
No items found.

Optimizing GenAI in Production: High-Value Use Cases for AI Gateways

October 23, 2024
Alex Ly

As organizations increasingly adopt Generative AI (GenAI) into their production environments, they face new challenges beyond initial implementation. Scalability constraints, security vulnerabilities, and performance bottlenecks can block the rollout of AI applications if improperly managed. Practical issues such as security concerns around user data, fixed capacity limitations, cost management, and latency optimization require innovative solutions.

At Solo.io, we’re at the forefront of the GenAI journey, supporting our customers as they test, research, and implement AI across their applications and services. In this blog, we dive into some of the unique use cases and strategies they are implementing such as semantic caching, prompt guard, prompt enrichment, and dynamic load balancing to empower their business to build resilient, secure, and scalable AI-powered applications. By exploring these concrete examples, we aim to provide actionable insights for organizations striving to harness the potential of GenAI while overcoming operational hurdles fully.

The “Hi” Use Case:
For our customers, how their end users interact with their products and services is a primary revenue driver. GenAI has transformed accessibility and engagement channels, offering companies a new pathway to faster revenue, more loyal consumers, and improved customer experiences. At Solo.io, we’ve observed this same shift across our customer base and have been supporting a global telecommunications provider that has developed a GenAI-powered chatbot designed to help enhance their customer’s experience.

From their experience, we can see that in many AI-driven customer interactions, users often begin with a default greeting like “Hi.” If using a model such as ChatGPT 4o, the typical response to this is “Hello! How can I assist you today?” This simple exchange, while routine, can quickly add up in token costs for businesses handling high volumes of interactions. Each request of “Hi” consumes 1 token, and the model’s default response uses 9 tokens, resulting in a cost of 10 tokens for this simple interaction. (Source: OpenAI Tokenizer)

While these small numbers might seem insignificant, if an organization was to scale to 1 billion requests per year, the total cost amounts to $92,500! A respectful greeting and response is important to the customer experience, but this example represents a significant opportunity for cost reduction.

Token Type
Tokens (Billion)
Cost Per Million
Total Cost
Input
1
$2.50
$2,500.00
Output
9
$10.00
$90,000.00
Total
$92,500.00
GPT 4o – Cost Breakdown

This is where semantic caching becomes a game-changer. Since the vast majority of users send this same initial input and receive the same output, there’s no need to generate a new response every time. By caching this common interaction, businesses can ensure that only the input token is counted for each request, while the output is served from the cache. With a 100% cache hit rate and no expiry, this reduces the cost to just $2,500 per year for the input tokens only—saving $90,000 on repeated, non-unique interactions. This approach not only optimizes costs but also reallocates resources to exchanges that drive direct value-add to the end user.

Application Performance Advantage:
The status quo today is that organizations implementing GenAI technology often expect some level of latency in their interactions. Optimizing performance through semantic caching offers a powerful way to enhance user experience and represents a competitive advantage for businesses looking to drive adoption of their AI-powered applications.

Semantic caching allows frequently repeated queries, like the common “Hi” request, to be served in sub-millisecond times instead of seconds. Additionally, by configuring a similarity score, slight variations like “Hey” or “Hello” can also be served from the cache.

Using a solution like Gloo AI Gateway, adding responses to the cache is straightforward. Common requests like “How do I reach Customer Support?” or “Can you link me to the FAQ?” can be cached with minimal configuration, and since caching is done semantically, the Gloo AI Gateway returns a cached response for any requests similar to the cached one. This provides both speed and consistency while reducing resource consumption. For Enterprises, this translates into not only enhanced user satisfaction but also significant cost savings and increased scalability benefits.

Prompt Guard Use Cases

Prompt Guard Control:
While deploying GenAI technology may seem straightforward, organizations still face challenges. For example, when our customer started building a chatbot implementation requiring a GenAI application to be in production, we identified that there potentially was a portion of queries that could be inappropriate or malicious. Allowing such queries to reach the language model (LLM) wastes resources, but also introduces unnecessary risks.

By blocking inappropriate prompts at the gateway, the application avoids both the latency cost and token cost associated with sending such requests to the LLM. For example, a simple regex pattern can be applied to flag inappropriate language or prevent certain keywords, ensuring that these requests never make it past the gateway. This approach not only reduces operational costs but also strengthens their application’s security and improves application efficiency delivering an overall better experience.

Advanced Prompt Guard:
As workloads become more complex and advanced, we also anticipate that are also potential use cases where filtering out inappropriate or sensitive data goes beyond simple string matching or regex rules. For instance, detecting and scrubbing Personally Identifiable Information (PII) in user input requires more sophisticated handling, and this is where an Advanced Prompt Guard comes into play. Developers can instead of relying on basic techniques, have inputs instead first sent to a locally served, fine-tuned Small Language Model (SLM) that specializes in detecting, scrubbing, and sanitizing sensitive data before it is passed on to the backend LLM.

Unlike regex or string matching, which are limited by predefined patterns and struggle with context, an SLM can analyze inputs dynamically, identifying nuanced or obfuscated PII, such as names, addresses, or payment information.

AI Gateway Use Cases Explored Blog

The SLM then processes the input, scrubbing sensitive information in real-time, and returns a cleaned version of the prompt. Only this sanitized version is forwarded to the LLM for further processing. This layered approach not only safeguards user data but also ensures that the AI system operates within compliance and security standards, without incurring unnecessary risks or costs.

Prompt Enrichment Use Cases

Adding Layers of Control for Business and Security Needs
As GenAI implementations become more complex, we recommend customers to add layers of nuanced control to enhance the security of their GenAI workloads. This is where prompt engineering becomes crucial, as it allows developers to fine-tune how AI models respond to user inputs, optimizing both performance and behavior. While developers focus on crafting the best possible prompts for their specific use cases, businesses often need additional layers of control to meet security, compliance, and organizational standards.

AI Gateway Use Cases Explored Blog

Prompt Enrichment at the gateway level provides this additional control, allowing extended teams such as  Security and Platform to prepend or append business-specific prompts without altering the underlying application code. For example, a Security team might enforce system-level prompts such as: “If the request involves legal, medical, or personal identification information (PII), respond with ‘I’m sorry, but I cannot assist with that topic. Please consult a professional for more information.’”

This serves as a critical first layer in a defense-in-depth strategy, complemented by Prompt Guard for scenarios where the model might hallucinate and violate the prompt. Business teams can also use Prompt Enrichment to customize responses by adding multilingual translations or applying filters for sensitive content. With Prompt Enrichment, organizations maintain flexibility, enforce essential safeguards, and ensure consistent messaging without limiting developers’ ability to innovate at the application level.

Load Balancing Use Cases

As our customer’s chatbot AI implementation starts to mature and requires expansion into additional workloads, finding an efficient and secure way to manage and optimize their GenAI capacity across multiple deployments will become a critical component of the organization’s successful integration of GenAI.

For example, platforms like Azure OpenAI, capacity is provisioned and metered through PTUs (Processing Throughput Units), with each deployment constrained by a fixed quota. These deployments are often tightly coupled with applications, which presents challenges when capacity is depleted, or performance needs to be optimized without disrupting service. Admins can create multiple endpoints to manage capacity for various applications, regions, or business units, but dynamically routing traffic across these endpoints requires sophisticated load balancing strategies. From managing capacity constraints to regional routing and ephemeral backend support, businesses need flexible, automated solutions to ensure high availability, low latency, and seamless scaling in their GenAI infrastructure.

Capacity Transparency
We recommend to customers as their teams deploy GenAI across their applications and services, they should take note of the SLAs around provisioning capacity to understand scaling restrictions. For example, the Capacity Transparency from Microsoft provides transparency around capacity but does not guarantee that availability will be there when you need it. Capacity is allocated at the time of deployment and remains reserved as long as the deployment exists. However, scaling down or deleting a deployment releases that capacity back to the region, with no assurance that it will be available again for future scaling or redeployment. Given this uncertainty, implementing an abstraction layer that can route traffic to any available backend can mitigate the risks associated with scaling limitations and ensure greater flexibility.

Handling Fixed Capacity Consumption Without Application-Level Changes
If teams have any applications hardcoded to specific backend LLM endpoints (e.g., acme-gpt4o-01.openai.azure.com), reaching capacity limits can cause disruptions if the application must be restarted to switch to a different backend. An effective solution is for teams to implement a dynamic load-balancing layer at the gateway level. This enables traffic to be automatically routed to alternate endpoints before capacity thresholds are reached, ensuring uninterrupted service without requiring application changes or restarts. This approach ensures high availability and seamless scalability, allowing them to manage capacity effectively while keeping applications stable and operational.

Centralized Traffic Management Across LLM Endpoints
From a security standpoint, we also recommend that customers route all LLM traffic through a single egress point for crucial security, compliance, and operational oversight. This ensures that all interactions are monitored, logged, and adhere to their organization’s policies. However, enforcing this while managing multiple LLM deployments across regions and applications can be challenging. To balance security and flexibility, centralized traffic management via Gloo AI Gateway can be implemented.

AI Gateway Use Cases Explored Blog

This setup directs all LLM traffic through a single, controlled point, while still allowing dynamic routing to backend deployments for capacity or performance optimization. This solution enables strict security compliance without sacrificing the flexibility to scale and optimize backend capacity.

Reducing User Latency with Regionally Aware Routing
If our customers havedistributed applications, minimizing user latency would be a key factor in improving user experience. One way to achieve this is by ensuring requests are routed to the nearest backend LLM deployment based on the user’s geographical location. By implementing regionally aware routing, the application can automatically direct traffic to the closest backend LLM endpoint (e.g., acme-gpt4o-02.openai.azure.com for Europe, acme-gpt4o-03.openai.azure.com for Asia).

AI Gateway Use Cases Explored Blog

This could potentially be managed through host-based routing at the gateway level, allowing the application to remain geographically flexible and optimize performance without manual intervention.

Supporting Ephemeral Backend Capacity with Dynamic Routing
Managing backend capacity often requires the flexibility to deploy and remove LLM instances based on real-time demand. A challenge arises when ephemeral capacity needs to be spun up for short-term usage without affecting traffic flow to existing deployments. With dynamic routing, traffic can be automatically directed to newly deployed LLMs as they come online, without disrupting ongoing operations. This ensures that capacity constraints can be addressed on demand while scaling down deployments frees up resources without impacting the application’s performance or requiring manual reconfiguration.

Automating LLM Capacity Deployment via Pipeline Triggers
Automating the deployment of LLM capacity is essential for responding to real-time demand fluctuations. For example, a pipeline can be configured to trigger the creation of new LLM deployments with fixed capacity when predefined conditions are met, such as approaching quota limits or increased traffic. Once a new deployment is created, routing rules can be automatically updated to direct traffic to the new LLM capacity. This approach streamlines the process, reducing the risk of capacity shortages and ensuring continuous availability without manual intervention.

Although GenAI is still in its early stages, some of our innovative customers are already integrating it into their services and applications to enhance customer experiences. In this process, they have partnered with us at Solo.io to troubleshoot the challenges associated with GenAI-powered workloads, including resource management, security, and performance optimization. We have strategically collaborated with them to address these issues while building new features into Gloo AI Gateway, ensuring the security and efficient management of their AI workloads.

Next Steps

You can learn more about Gloo AI Gateway from the Solo product page or the technical docs.

Alternatively, get in touch with our team to see how Gloo AI Gateway can help drive AI innovation in your organization or learn more about AI and LLM API management in our technical lab series.

Cloud connectivity done right