PG Phriday: Papa's Got a Brand New RAG

Friday April 26, 2024

Page content

Remember this guy?

AI is all anyone talks about these days, isn’t it? Even when it comes to databases like Postgres, devs are finding new and innovative ways to leverage LLMs in everyday use cases. Can you really blame them though? This is an exciting new technology that will transform how we work and potentially society as a whole once it finally matures. We even covered building a crude RAG app a few short weeks ago.

Technology moves fast though—especially when it comes to AI—and the techniques we use will inevitably evolve accordingly. So let’s rebuild our crude RAG app using a more modern approach that handles much of the work for us!

There’s a lot of code here, so check out the git page to follow along more easily.

A Grab Bag of Tools

Like our previous attempt, we’re going to need a whole lot of prerequisite libraries, utilities, and at least one model. What’s neat about this is that acquiring all of these is the hard part.

Let’s go over the basics:

LM Studio - To easily browse and download LLMs.
Ollama - To run the LLMs and provide a REST API.
Docker - Quickly set up and tear down the database and related tools.
pg_vectorize - To do literally everything else.

Using LM Studio is as simple as launch and search. It can run LLMs on its own, but may not properly leverage GPU resources. The big win here is that it’s easy to find recent popular LLMs that will fit in local GPU RAM and see basic documentation on using them.

Ollama definitely can use local GPUs, and it’s a one-line install to get it running as a local service.

curl -fsSL https://ollama.com/install.sh | sudo sh

There’s really only one modification we need to make to Ollama after it’s installed, so that it binds to all IPs rather than just the default of localhost:

sudo systemctl edit ollama

Then paste these contents:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"

Restart Ollama for good measure, and it’s ready to roll!

sudo systemctl restart ollama

It should be running on all IPs on the default port of 11434, ready to serve any models we’ve installed.

Building the Brains

Assuming we downloaded a model using LM Studio, we need to read the documentation on how that particular LLM must be prompted. For the sake of this demonstration, we chose LLaMa3-Instruct as it’s incredibly recent and quite capable even at a mere 8B parameters. You could run this thing on a modestly-specced laptop.

That said, we need to import the model into Ollama. Since LLaMa3 is already fairly popular, the prompt template is well documented. We also need to know the common parameters.

Since we’re building a Postgres RAG stack, we can also pre-load the system prompt to be focused on Postgres. If we take all of that into consideration, we’d want to create a file named llama3.modalfile with these contents:

FROM Meta-Llama-3-8B-Instruct.Q8_0.gguf
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
SYSTEM """You are a PostgreSQL expert who always answers user questions with concise answers about Postgres."""
PARAMETER stop <|start_header_id|>
PARAMETER stop <|end_header_id|>
PARAMETER stop <|eot_id|>
PARAMETER num_keep 24

Then we can import the model like this:

ollama create llama3 -f llama3.modalfile

Now we can use the model any time we want. Let’s test it for good measure:

ollama run llama3

>>> Please give one short sentence that best describes Postgres.

PostgreSQL is an open-source relational database management system known for its reliability, data integrity, and scalability.

Great!

Chunky Style

In our previous foray, we built a content importer that calculated text embeddings based on content chunks during the import process. This is a great way to use a local sentence transformer rather than a service like OpenAI, but we’re going to be using pg_vectorize for this, so we can remove all of that code.

That means our tables look more like this:

CREATE TABLE blog_articles (
  article_id    BIGINT PRIMARY KEY NOT NULL GENERATED ALWAYS AS IDENTITY,
  author        TEXT,
  title         TEXT,
  content       TEXT,
  publish_date  DATE,
  last_updated  TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE blog_article_chunks (
  chunk_id      BIGINT PRIMARY KEY NOT NULL GENERATED ALWAYS AS IDENTITY,
  article_id    BIGINT NOT NULL REFERENCES blog_articles,
  chunk         TEXT,
  last_updated  TIMESTAMPTZ NOT NULL DEFAULT now()
);

Rather than a table named blog_article_embeddings, we now track the chunks separately. At least for now, LLMs still often have problems with context length, and 8k seems to be the most common limitation. Even a single blog article is likely to cross this boundary, so it’s best to still operate with more restricted extracts instead. As the technology matures, this limitation is likely to be lifted, but for now, we need to account for it.

The Short Slurp

The Python code is also dramatically simplified without all of the overhead of generating sentence transforms during the import process. Our new importer looks like this:

import rag
import config
import psycopg2

from glob import glob

import_path = './corpus'

conn = psycopg2.connect(
  host = 'my-db-host', port = 5432, user = 'postgres',
  password = None, database = 'postgres'
)

cursor = conn.cursor()
chunker = rag.Splitter()

for blog in glob(import_path + '/*.md'):
  article = rag.parse_hugo(blog)
  print("Importing " + article['title'], flush=True)

  cursor.execute(
    """INSERT INTO blog_articles (author, title, content, publish_date)
       VALUES (%s, %s, %s, %s) RETURNING article_id""",
    (article['author'], article['title'], article['content'], article['date'])
  )

  article_id = cursor.fetchone()[0]
  chunks = chunker.chunk_content(article['content'])

  for i in range(len(chunks)):
    cursor.execute(
      """INSERT INTO blog_article_chunks (article_id, chunk)
         VALUES (%s, %s)""",
      (article_id, chunks[i])
    )

  conn.commit()

That may not seem like much of a change, and it isn’t. But take a closer look at the chunk_content method stripped of its class wrappings:

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    separators = ["\n\n", "\n", ' ', '.', '```'],
    chunk_size = 500,
    chunk_overlap = 20,
    length_function = len,
    is_separator_regex = False
)

def chunk_content(content):
  chunks = splitter.split_text(content)
  return chunks

The transformer is gone, the vector embeddings are gone, and all we’re left with is a reusable splitter definition and a function call. As a result, importing is much faster than before.

Starting Services

This is the part where we use Docker to run all the stuff we’re too lazy to install locally. We only really need two things: a Postgres instance, and an embedding server. Since we need to run these two things in tandem, it makes sense to use docker compose. Tembo likes to boast about their extension stacks, one of which is based on AI use cases. They’ve been kind enough to maintain images for demonstration and development purposes, and this includes an image that provides all the prerequisites for pg_vectorize and an image which can act as an embedding server.

This simple docker-compose.yml will launch both of these containers, and notice the extra_hosts line where we use host.docker.internal. This will allow the docker containers to reach back into the host system where our instance of Ollama is running. We’ll need this for later!

services:
  postgres:
    container_name: rag-postgres
    image: quay.io/tembo/vectorize-pg:latest
    ports:
      - 5432:5432
    environment:
       - POSTGRES_HOST_AUTH_METHOD=trust
    extra_hosts:
      - "host.docker.internal:host-gateway"
    restart: unless-stopped
  vector-serve:
    container_name: rag-vector-serve
    image: quay.io/tembo/vector-serve:latest
    ports:
      - 3000:3000
    restart: unless-stopped

Then we can create and start the containers, getting everything ready for our RAG app.

docker compose create
docker compose start

At this point, we can connect to our database and initialize the vectorize extension which will be doing all the heavy lifting from now on.

psql -h localhost -U postgres -d postgres

postgres=# CREATE EXTENSION vectorize CASCADE;

NOTICE:  installing required extension "pg_cron"
NOTICE:  installing required extension "pgmq"
NOTICE:  installing required extension "vector"
CREATE EXTENSION
Time: 76.205 ms

Why a docker image for Postgres? Installing pg_vectorize currently requires several steps as the extension is still relatively new. There are not yet Debian or RPM packages for easy distribution, so the fastest way to play with it is to use a pre-built image.

What’s your Vector Victor?

Despite purging embeddings from our content import process, we still need embeddings! Thankfully the pg_vectorize extension provides the ability to maintain embeddings on our behalf. As long as we’re using version 0.6.0 or higher, we can specify any OpenAI-compatible service rather than use an OpenAI account.

Given what we know about our Docker environment, we need to set a parameter for pg_vectorize to recognize our custom server:

ALTER SYSTEM SET vectorize.embedding_service_url TO 'http://rag-vector-serve:3000/v1/embeddings';
SELECT pg_reload_conf();

We should also note that the vector-serve image comes bootstrapped with the all-MiniLM-L12-v2 sentence transformer. It can download others to cache locally, but if we use the one it comes with, we don’t need to do any extra work.

Assuming we’ve imported our Hugo Markdown corpus using import.py, our blog_articles and blog_article_chunks tables should have sufficient content to get started. We only need a single SQL command to create the embeddings:

SELECT vectorize.table(
    job_name    => 'blog_article_chunks_rt',
    "table"     => 'blog_article_chunks',
    primary_key => 'chunk_id',
    update_col  => 'last_updated',
    columns     => ARRAY['chunk'],
    transformer => 'sentence-transformers/all-MiniLM-L12-v2',
    schedule    => 'realtime'
);

INFO:  creating queue;
                      table                       
--------------------------------------------------
 Successfully created job: blog_article_chunks_rt

The job name is how pg_vectorize keeps track of tables that need regular updates to embeddings. By setting the schedule to realtime, it attaches a trigger to the blog_article_chunks table so any update to chunk content will also update the embedding. One of the reasons pg_vectorize requires the pgmq extension is so this procedure is asynchronous. Our command launches the job in the background and we get our psql prompt back immediately, while the embeddings get generated.

Once the job completes, we should see a few new tables in the vectorize schema:

SELECT tablename
  FROM pg_tables
 WHERE schemaname = 'vectorize'
   AND tablename LIKE '_embed%';

             tablename              
------------------------------------
 _embeddings_blog_article_chunks_rt
(1 row)

From this point onward, any new blogs we insert, or any we modify, will automatically get new embeddings without any extra work from us. But we’re building a RAG app here, not just a system to maintain embeddings. So rather than using vectorize.table, let’s leverage a different tactic!

Enabling RAG

With the release of v0.15, the pg_vectorize extension gained the ability to send queries to LLMs other than OpenAI. This is the real reason we set up our own Ollama server. As with sentence transformers, we can specify a configuration parameter to control how RAG requests are resolved:

ALTER SYSTEM set vectorize.ollama_service_url TO 'http://host.docker.internal:11434';
SELECT pg_reload_conf();

If we were using some dedicated GPU server or cloud service, we would specify that URL rather than the docker host. Regardless, once that parameter is set, we simply need to initialize the RAG functionality of pg_vectorize:

SELECT vectorize.init_rag(
    agent_name          => 'blog_chat',
    table_name          => 'blog_article_chunks',
    "column"            => 'chunk',
    unique_record_id    => 'chunk_id',
    transformer         => 'sentence-transformers/all-MiniLM-L12-v2',
    schedule            => 'realtime'
);

              init_rag               
-------------------------------------
 Successfully created job: blog_chat

Note how similar this is to the vectorize.table approach. By telling pg_vectorize we want to build a RAG app, it knows we need embeddings as well, so it will maintain them in the same way as if we’d simply performed that step separately.

Believe it or not, we now have a fully functional RAG app, purely through SQL. Observe:

SELECT vectorize.rag(                                                                
    agent_name  => 'blog_chat',        
    query       => 'How can I handle row merge or upserts with Postgres?',
    chat_model  => 'ollama/llama3'
) -> 'chat_response';

"Use a single SQL statement with the `INSERT ... ON CONFLICT` syntax. This allows you to perform an insert operation that will either create a new row or update an existing one, depending on whether the row already exists."

Why does the answer not cover the actual Postgres MERGE syntax? Because this blog corpus doesn’t have an article on that topic! An 8B parameter model is unlikely to have very deep knowledge about Postgres v15 and above, so most of what it does know comes from our blog excerpts.

For more information on how pg_vectorize handles sentence transformers and models, check out the official documentation.

Easiest RAG App Ever

So how does this affect our final Python RAG app? See for yourself:

import psycopg2

question = sys.argv[1]

query = """
SELECT vectorize.rag(                                                                
    agent_name  => 'blog_chat',        
    query       => %s,
    chat_model  => 'ollama/llama3'
) -> 'chat_response';
"""

conn = psycopg2.connect(
  host = 'my-db-host', port = 5432, user = 'postgres',
  password = None, database = 'postgres'
)

print("Sending your question and references to the LLM...", flush=True)

with conn.cursor() as cursor:
  cursor.execute(query, (question,))
  answer = cursor.fetchone()[0]

conn.close()

print("Response:\n\n", flush=True)
print(answer)

Well that was anticlimactic! Our once mighty RAG app which directly wielded the cosmic power of an LLM has been reduced to a mere query wrapper. But does it work?

python question-rag.py "Is Postgres the best database engine?"

Four times since 2017, it has won the DB-Engines "DBMS of the Year" award.

Well said!

Final Thoughts

Postgres extensions make it possible for it to act as a kind of middleware. While this has been a known aspect for about a decade now, the full implications of this have yet to be realized. Despite how it looks, Postgres isn’t actually doing much work here. The sentence transformer is an external service. The LLM server is also external. All pg_vectorize does is tie them together with standardized API calls in an extremely common usage pattern.

Remember, all RAG apps essentially do this:

Transform a user’s question into some rows from the database.
Feed the result rows into the LLM along with the user’s original question.
Return the LLM’s response back to the user.
Repeat

This is such a mundane loop, and the database already handles most of it, so why not move the process into Postgres itself? This way no vectors ever leave Postgres. Nor must we manually loop through the list of results and feed them back into the prompt. If we’re only interested in the answer, that’s the only portion of the response we need.

It may be that pg_vectorize has some limitation or implementation detail we want to circumvent in a complex application, and ultimately we may need to return to our previous approach of handling the LLM calls locally. For all other cases, it’s a quick, easy, and usable solution for throwing together a RAG app with almost no other expertise. If you have a table with text columns, you can have a RAG app in minutes.

And it will only accelerate from here!