Today, we'll discuss Retrieval Augmented Generation (RAG), a technique that combines the power of large language models (LLMs) with real-time data retrieval. This allows your AI applications to access up-to-date information, improving the relevance and accuracy of its responses.

What is RAG?

Retrieval Augmented Generation (RAG) is a method that allows an AI model to retrieve external information from databases, documents, or the web, and then use that data to generate more informed and accurate responses. While traditional LLMs have knowledge based on their training data, RAG ensures that the AI can supplement this information with real-time, external data. This helps make the AI’s responses more dynamic and contextually appropriate.

Why should you consider RAG?

There are several reasons why RAG is valuable:

Up-to-date information: Traditional LLMs are limited to the knowledge available up to a certain point in time. With RAG, the AI can access the latest information, ensuring it stays current.
Improved accuracy: By incorporating external data into its response generation, RAG helps the AI model provide more precise and contextually relevant answers.
Flexibility: RAG allows integration with various data sources, whether they be documents, databases, or APIs, making the AI’s knowledge base adaptable to different needs.

How does RAG work?

Here’s a brief overview of the process:

‍Retrieve: When a user query is received, the system searches external data sources for the most relevant information. Typically it uses semantic similarity to pick the most relevant results.
Augment: The retrieved information is then combined with the original prompt to provide additional context.‍
Generate: The LLM processes this enriched prompt to generate a more accurate and relevant response.

Implementing RAG with Solo.io’s Gloo AI Gateway

Now that you understand the benefits of RAG, let’s walk through how to set it up using Solo.io’s Gloo AI Gateway. Below is a high-level overview of the steps required to integrate RAG into your system.

Prerequisites

Before getting started with RAG, make sure you have the following:

Gloo Gateway Enterprise license key with an AI Gateway add-on: Contact a Solo.io account representative to obtain a Gloo Gateway Enterprise license key. Make sure you include the AI Gateway add-on in your license.
Gloo Gateway Enterprise installation: If you haven't already, you can follow the Solo.io Gloo Gateway docs.‍
Data sources: This blog gives an example data source. If you want to use your own, ensure that the data sources (such as APIs, databases, or document repositories) you plan to use are available and accessible.

Set up the Gloo AI Gateway

If you haven’t already, start by setting up your Gloo AI Gateway, and authenticating the gateway with your AI provider. To get started, follow these Gloo AI Gateway docs:

AI response before implementing RAG

Our AI gateway is ready to route requests to the LLM. Before we implement RAG, let's check what the non-augmented AI model returns when we ask it questions.

First, get the external address for your AI gateway.

export INGRESS_GW_ADDRESS=$(kubectl get svc -n gloo-system gloo-proxy-ai-gateway -o jsonpath="{.status.loadBalancer.ingress[0]['hostname','ip']}")
echo $INGRESS_GW_ADDRESS

For this demo, let's try asking our AI about French cheeses.

curl "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{
  "model": "gpt-4o",
  "messages": [
    {
      "role": "user",
      "content": "How many varieties of cheeses are in France?"
    }
  ]
}'

You'll likely get a response similar to this one:

{
"id": "chatcmpl-AEJFJIavD5NkGwyduU82sHbpj2fS7",
"object": "chat.completion",
"created": 1727973937,
"model": "gpt-4o-2024-08-06",
"choices": [
  {
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "France is famous for its vast variety of cheeses, and it's often said that there are over 1,000 different types. This number can vary depending on how cheeses are classified, considering factors like regional variations, aging processes, and even seasonal differences. Charles de Gaulle famously remarked about the difficulty of governing a country with \"246 varieties of cheese,\" but the actual number is considerably higher when all local and artisanal varieties are counted.",
      "refusal": null
    },
    ...

Not too bad! But, the response is fairly verbose, and doesn't give us the concise, direct answer that we might be looking for.

AI response after implementing RAG

Now, let's try providing our LLM with some new, more precise data to generate responses from by using RAG.

Start by making your data source available in your cluster. For our French cheese demo, this Kubernetes deployment creates a vector database that includes data and embeddings from a website that provides information about French cheeses. The service then makes it accessible to other services.

kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vector-db
  labels:
    app: vector-db
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vector-db
  template:
    metadata:
      labels:
        app: vector-db
    spec:
      containers:
      - name: db
        image: gcr.io/field-engineering-eu/vector-db
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 5432
        env:
        - name: POSTGRES_DB
          value: gloo
        - name: POSTGRES_USER
          value: gloo
        - name: POSTGRES_PASSWORD
          value: gloo
---
apiVersion: v1
kind: Service
metadata:
  name: vector-db
spec:
  selector:
    app: vector-db
  ports:
    - protocol: TCP
      port: 5432
      targetPort: 5432
EOF

Next, we can make sure the LLM accesses this data by creating a RouteOption resource. This resource configures the HTPRoute you made earlier to use the vector database service for RAG.

Essentially, Gloo AI Gateway ensures that when it passes along any requests to the LLM for this route, it instructs the LLM to include the provided data source when it generates a response.

kubectl apply -f - <<EOF
apiVersion: gateway.solo.io/v1
kind: RouteOption
metadata:
  name: openai-opt
  namespace: gloo-system
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: openai
  options:
    ai:
      rag:
        datastore:
          postgres:
            connectionString: postgresql+psycopg://gloo:gloo@vector-db.default.svc.cluster.local:5432/gloo
            collectionName: default
        embedding:
          openai:
            authToken:
              secretRef:
                name: openai-secret
                namespace: gloo-system
    timeout: "0"
EOF

To see the difference RAG makes, let's repeat our earlier request and ask it about French cheeses again.

curl "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{
  "model": "gpt-4o",
  "messages": [
    {
      "role": "user",
      "content": "How many varieties of cheeses are in France?"
    }
  ]
}'

This time, the AI gateway uses the RAG options that you set up to automatically attach additional context to the query. The response is improved to be much more concise:

{
  "id": "chatcmpl-AGsLfbPZY6Ld2u9PtX473jgdj4KA4",
  "object": "chat.completion",
  "created": 1728585527,
  "model": "gpt-4o-2024-08-06",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "France has between 1,000-1,600 varieties of cheese.",
        "refusal": null
      },
...

Conclusion

In summary, Retrieval Augmented Generation (RAG) is an innovative technique that enables AI models to access and utilize real-time data, significantly improving their responsiveness and accuracy. Solo.io’s AI Gateway offers a robust platform to implement RAG, allowing you to integrate external data sources with ease.

By leveraging RAG, you can enhance the intelligence of your AI systems, ensuring they remain accurate and relevant in an ever-changing world. We encourage you to explore Solo.io’s documentation for more detailed instructions and additional resources or check out our free hands-on lab here.

Happy coding, and we hope this helps you get started with RAG!

Enhancing Gloo AI Gateway with Retrieval Augmented Generation (RAG)

What is RAG?

Why should you consider RAG?

How does RAG work?

Implementing RAG with Solo.io’s Gloo AI Gateway

Prerequisites

Set up the Gloo AI Gateway

AI response before implementing RAG

AI response after implementing RAG

Conclusion

Featured content

Monitor LLM usage with Gloo AI Gateway Consumption Reporting

Semantic Caching with Gloo AI Gateway

Protect your AI-powered apps with tiered rate limiting in Gloo AI Gateway

Why Do We Need a New Gateway for AI Agents?

An Agent Mesh for Enterprise Agents

Missed KubeCon EU 2025? Here’s What You Need to Know About Service Mesh, AI & Gateways from Solo.io

Cloud connectivity done right