PG Phriday: Papa's Got a Brand New RAG
Remember this guy?
AI is all anyone talks about these days, isn’t it? Even when it comes to databases like Postgres, devs are finding new and innovative ways to leverage LLMs in everyday use cases. Can you really blame them though? This is an exciting new technology that will transform how we work and potentially society as a whole once it finally matures. We even covered building a crude RAG app a few short weeks ago.
Technology moves fast though—especially when it comes to AI—and the techniques we use will inevitably evolve accordingly. So let’s rebuild our crude RAG app using a more modern approach that handles much of the work for us!
There’s a lot of code here, so check out the git page to follow along more easily.
A Grab Bag of Tools
Like our previous attempt, we’re going to need a whole lot of prerequisite libraries, utilities, and at least one model. What’s neat about this is that acquiring all of these is the hard part.
Let’s go over the basics:
- LM Studio - To easily browse and download LLMs.
- Ollama - To run the LLMs and provide a REST API.
- Docker - Quickly set up and tear down the database and related tools.
- pg_vectorize - To do literally everything else.
Using LM Studio is as simple as launch and search. It can run LLMs on its own, but may not properly leverage GPU resources. The big win here is that it’s easy to find recent popular LLMs that will fit in local GPU RAM and see basic documentation on using them.
Ollama definitely can use local GPUs, and it’s a one-line install to get it running as a local service.
curl -fsSL https://ollama.com/install.sh | sudo sh
There’s really only one modification we need to make to Ollama after it’s installed, so that it binds to all IPs rather than just the default of localhost:
sudo systemctl edit ollama
Then paste these contents:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Restart Ollama for good measure, and it’s ready to roll!
sudo systemctl restart ollama
It should be running on all IPs on the default port of 11434, ready to serve any models we’ve installed.
Building the Brains
Assuming we downloaded a model using LM Studio, we need to read the documentation on how that particular LLM must be prompted. For the sake of this demonstration, we chose LLaMa3-Instruct as it’s incredibly recent and quite capable even at a mere 8B parameters. You could run this thing on a modestly-specced laptop.
That said, we need to import the model into Ollama. Since LLaMa3 is already fairly popular, the prompt template is well documented. We also need to know the common parameters.
Since we’re building a Postgres RAG stack, we can also pre-load the system prompt to be focused on Postgres. If we take all of that into consideration, we’d want to create a file named llama3.modalfile
with these contents:
FROM Meta-Llama-3-8B-Instruct.Q8_0.gguf
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>
{{ .Response }}<|eot_id|>"""
SYSTEM """You are a PostgreSQL expert who always answers user questions with concise answers about Postgres."""
PARAMETER stop <|start_header_id|>
PARAMETER stop <|end_header_id|>
PARAMETER stop <|eot_id|>
PARAMETER num_keep 24
Then we can import the model like this:
ollama create llama3 -f llama3.modalfile
Now we can use the model any time we want. Let’s test it for good measure:
ollama run llama3
>>> Please give one short sentence that best describes Postgres.
PostgreSQL is an open-source relational database management system known for its reliability, data integrity, and scalability.
Great!
Chunky Style
In our previous foray, we built a content importer that calculated text embeddings based on content chunks during the import process. This is a great way to use a local sentence transformer rather than a service like OpenAI, but we’re going to be using pg_vectorize
for this, so we can remove all of that code.
That means our tables look more like this:
CREATE TABLE blog_articles (
article_id BIGINT PRIMARY KEY NOT NULL GENERATED ALWAYS AS IDENTITY,
author TEXT,
title TEXT,
content TEXT,
publish_date DATE,
last_updated TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE TABLE blog_article_chunks (
chunk_id BIGINT PRIMARY KEY NOT NULL GENERATED ALWAYS AS IDENTITY,
article_id BIGINT NOT NULL REFERENCES blog_articles,
chunk TEXT,
last_updated TIMESTAMPTZ NOT NULL DEFAULT now()
);
Rather than a table named blog_article_embeddings
, we now track the chunks separately. At least for now, LLMs still often have problems with context length, and 8k seems to be the most common limitation. Even a single blog article is likely to cross this boundary, so it’s best to still operate with more restricted extracts instead. As the technology matures, this limitation is likely to be lifted, but for now, we need to account for it.
The Short Slurp
The Python code is also dramatically simplified without all of the overhead of generating sentence transforms during the import process. Our new importer looks like this:
import rag
import config
import psycopg2
from glob import glob
import_path = './corpus'
conn = psycopg2.connect(
host = 'my-db-host', port = 5432, user = 'postgres',
password = None, database = 'postgres'
)
cursor = conn.cursor()
chunker = rag.Splitter()
for blog in glob(import_path + '/*.md'):
article = rag.parse_hugo(blog)
print("Importing " + article['title'], flush=True)
cursor.execute(
"""INSERT INTO blog_articles (author, title, content, publish_date)
VALUES (%s, %s, %s, %s) RETURNING article_id""",
(article['author'], article['title'], article['content'], article['date'])
)
article_id = cursor.fetchone()[0]
chunks = chunker.chunk_content(article['content'])
for i in range(len(chunks)):
cursor.execute(
"""INSERT INTO blog_article_chunks (article_id, chunk)
VALUES (%s, %s)""",
(article_id, chunks[i])
)
conn.commit()
That may not seem like much of a change, and it isn’t. But take a closer look at the chunk_content
method stripped of its class wrappings:
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
separators = ["\n\n", "\n", ' ', '.', '```'],
chunk_size = 500,
chunk_overlap = 20,
length_function = len,
is_separator_regex = False
)
def chunk_content(content):
chunks = splitter.split_text(content)
return chunks
The transformer is gone, the vector embeddings are gone, and all we’re left with is a reusable splitter definition and a function call. As a result, importing is much faster than before.
Starting Services
This is the part where we use Docker to run all the stuff we’re too lazy to install locally. We only really need two things: a Postgres instance, and an embedding server. Since we need to run these two things in tandem, it makes sense to use docker compose
. Tembo likes to boast about their extension stacks, one of which is based on AI use cases. They’ve been kind enough to maintain images for demonstration and development purposes, and this includes an image that provides all the prerequisites for pg_vectorize
and an image which can act as an embedding server.
This simple docker-compose.yml
will launch both of these containers, and notice the extra_hosts
line where we use host.docker.internal
. This will allow the docker containers to reach back into the host system where our instance of Ollama is running. We’ll need this for later!
services:
postgres:
container_name: rag-postgres
image: quay.io/tembo/vectorize-pg:latest
ports:
- 5432:5432
environment:
- POSTGRES_HOST_AUTH_METHOD=trust
extra_hosts:
- "host.docker.internal:host-gateway"
restart: unless-stopped
vector-serve:
container_name: rag-vector-serve
image: quay.io/tembo/vector-serve:latest
ports:
- 3000:3000
restart: unless-stopped
Then we can create and start the containers, getting everything ready for our RAG app.
docker compose create
docker compose start
At this point, we can connect to our database and initialize the vectorize
extension which will be doing all the heavy lifting from now on.
psql -h localhost -U postgres -d postgres
postgres=# CREATE EXTENSION vectorize CASCADE;
NOTICE: installing required extension "pg_cron"
NOTICE: installing required extension "pgmq"
NOTICE: installing required extension "vector"
CREATE EXTENSION
Time: 76.205 ms
Why a docker image for Postgres? Installing pg_vectorize
currently requires several steps as the extension is still relatively new. There are not yet Debian or RPM packages for easy distribution, so the fastest way to play with it is to use a pre-built image.
What’s your Vector Victor?
Despite purging embeddings from our content import process, we still need embeddings! Thankfully the pg_vectorize
extension provides the ability to maintain embeddings on our behalf. As long as we’re using version 0.6.0 or higher, we can specify any OpenAI-compatible service rather than use an OpenAI account.
Given what we know about our Docker environment, we need to set a parameter for pg_vectorize
to recognize our custom server:
ALTER SYSTEM SET vectorize.embedding_service_url TO 'http://rag-vector-serve:3000/v1/embeddings';
SELECT pg_reload_conf();
We should also note that the vector-serve
image comes bootstrapped with the all-MiniLM-L12-v2
sentence transformer. It can download others to cache locally, but if we use the one it comes with, we don’t need to do any extra work.
Assuming we’ve imported our Hugo Markdown corpus using import.py
, our blog_articles
and blog_article_chunks
tables should have sufficient content to get started. We only need a single SQL command to create the embeddings:
SELECT vectorize.table(
job_name => 'blog_article_chunks_rt',
"table" => 'blog_article_chunks',
primary_key => 'chunk_id',
update_col => 'last_updated',
columns => ARRAY['chunk'],
transformer => 'sentence-transformers/all-MiniLM-L12-v2',
schedule => 'realtime'
);
INFO: creating queue;
table
--------------------------------------------------
Successfully created job: blog_article_chunks_rt
The job name is how pg_vectorize
keeps track of tables that need regular updates to embeddings. By setting the schedule
to realtime
, it attaches a trigger to the blog_article_chunks
table so any update to chunk content will also update the embedding. One of the reasons pg_vectorize
requires the pgmq
extension is so this procedure is asynchronous. Our command launches the job in the background and we get our psql prompt back immediately, while the embeddings get generated.
Once the job completes, we should see a few new tables in the vectorize
schema:
SELECT tablename
FROM pg_tables
WHERE schemaname = 'vectorize'
AND tablename LIKE '_embed%';
tablename
------------------------------------
_embeddings_blog_article_chunks_rt
(1 row)
From this point onward, any new blogs we insert, or any we modify, will automatically get new embeddings without any extra work from us. But we’re building a RAG app here, not just a system to maintain embeddings. So rather than using vectorize.table
, let’s leverage a different tactic!
Enabling RAG
With the release of v0.15, the pg_vectorize
extension gained the ability to send queries to LLMs other than OpenAI. This is the real reason we set up our own Ollama server. As with sentence transformers, we can specify a configuration parameter to control how RAG requests are resolved:
ALTER SYSTEM set vectorize.ollama_service_url TO 'http://host.docker.internal:11434';
SELECT pg_reload_conf();
If we were using some dedicated GPU server or cloud service, we would specify that URL rather than the docker host. Regardless, once that parameter is set, we simply need to initialize the RAG functionality of pg_vectorize
:
SELECT vectorize.init_rag(
agent_name => 'blog_chat',
table_name => 'blog_article_chunks',
"column" => 'chunk',
unique_record_id => 'chunk_id',
transformer => 'sentence-transformers/all-MiniLM-L12-v2',
schedule => 'realtime'
);
init_rag
-------------------------------------
Successfully created job: blog_chat
Note how similar this is to the vectorize.table
approach. By telling pg_vectorize
we want to build a RAG app, it knows we need embeddings as well, so it will maintain them in the same way as if we’d simply performed that step separately.
Believe it or not, we now have a fully functional RAG app, purely through SQL. Observe:
SELECT vectorize.rag(
agent_name => 'blog_chat',
query => 'How can I handle row merge or upserts with Postgres?',
chat_model => 'ollama/llama3'
) -> 'chat_response';
"Use a single SQL statement with the `INSERT ... ON CONFLICT` syntax. This allows you to perform an insert operation that will either create a new row or update an existing one, depending on whether the row already exists."
Why does the answer not cover the actual Postgres MERGE
syntax? Because this blog corpus doesn’t have an article on that topic! An 8B parameter model is unlikely to have very deep knowledge about Postgres v15 and above, so most of what it does know comes from our blog excerpts.
For more information on how pg_vectorize
handles sentence transformers and models, check out the official documentation.
Easiest RAG App Ever
So how does this affect our final Python RAG app? See for yourself:
import psycopg2
question = sys.argv[1]
query = """
SELECT vectorize.rag(
agent_name => 'blog_chat',
query => %s,
chat_model => 'ollama/llama3'
) -> 'chat_response';
"""
conn = psycopg2.connect(
host = 'my-db-host', port = 5432, user = 'postgres',
password = None, database = 'postgres'
)
print("Sending your question and references to the LLM...", flush=True)
with conn.cursor() as cursor:
cursor.execute(query, (question,))
answer = cursor.fetchone()[0]
conn.close()
print("Response:\n\n", flush=True)
print(answer)
Well that was anticlimactic! Our once mighty RAG app which directly wielded the cosmic power of an LLM has been reduced to a mere query wrapper. But does it work?
python question-rag.py "Is Postgres the best database engine?"
Four times since 2017, it has won the DB-Engines "DBMS of the Year" award.
Well said!
Final Thoughts
Postgres extensions make it possible for it to act as a kind of middleware. While this has been a known aspect for about a decade now, the full implications of this have yet to be realized. Despite how it looks, Postgres isn’t actually doing much work here. The sentence transformer is an external service. The LLM server is also external. All pg_vectorize
does is tie them together with standardized API calls in an extremely common usage pattern.
Remember, all RAG apps essentially do this:
- Transform a user’s question into some rows from the database.
- Feed the result rows into the LLM along with the user’s original question.
- Return the LLM’s response back to the user.
- Repeat
This is such a mundane loop, and the database already handles most of it, so why not move the process into Postgres itself? This way no vectors ever leave Postgres. Nor must we manually loop through the list of results and feed them back into the prompt. If we’re only interested in the answer, that’s the only portion of the response we need.
It may be that pg_vectorize
has some limitation or implementation detail we want to circumvent in a complex application, and ultimately we may need to return to our previous approach of handling the LLM calls locally. For all other cases, it’s a quick, easy, and usable solution for throwing together a RAG app with almost no other expertise. If you have a table with text columns, you can have a RAG app in minutes.
And it will only accelerate from here!