PG Phriday: A Dirty Postgres RAG

Friday March 29, 2024

Page content

Has it really come to this?

Postgres and AI go together like elephants and chocolate. At first glance, it seems like a silly combination. Postgres is an RDBMS for storing data with ACID compliance, functions, views, and maybe some extensions or foreign data wrappers. Where is room for AI in that? It may be trendy to take something, rub some AI on it, and then declare it a breakthrough technology, but that isn’t necessarily reality.

Large Language Models (LLMs) are the current state-of-the-art in the AI realm. Given how they work as one giant blob of data shoved haphazardly into thousands of eye-meltingly expensive GPUs, how could tuple-optimized databases contribute? Well, one aspect of LLMs is that they’re too general-purpose. Training a new one currently requires processing terabytes of source documents for weeks or even months using the afore-mentioned GPUs. So it’s not uncommon to fine-tune an existing one using a recursive feedback mechanism. But what if there was another way?

Because this article is just chock full of code, direct your attention to the git page for this blog if you want to follow along.

What is RAG?

RAG stands for Retrieval Augmented Generation. Does anyone happen to know what action databases are particularly accomplished at performing? Exactly! Databases are great at storing vast amounts of categorized data. Assuming we apply predicates to pull back specific data sets, we can feed that into the LLM rather than fine-tune one for every possible use case we need. While the database can’t really host the LLM itself, it can act as a middleware to feed the LLM that help increase its response accuracy and relevance.

In the context of Postgres, our pattern looks like this:

Transform a user’s question into some rows from the database.
Feed the result rows into the LLM along with the user’s original question.
Return the LLM’s response back to the user.
Repeat

The current trend is toward natural language processing—which LLMs happen to be very good at. Despite various efforts through Postgres Full Text Search and products like ZomboDB, Postgres isn’t exactly conversational. So how do we rub off that final wart so regular humans can talk to Postgres?

How do LLMs Work?

Well, obviously we need to rub AI on it! You think I’m kidding? One of the cool aspects of how LLMs work, is that they’re essentially holographic storage. They’re comprised of tokens formed from breaking up words, and multi-dimensional vectors plotting the position of each token in relation to other tokens in long predictive chains. And “multi-dimensional” is putting it lightly; imagine Euclidean Geometry if it accounted for thousands or even tens of thousands of axes.

Maybe don’t imagine that — Maybe *don’t* imagine that

Every training document influences the position of these vectors ever so slightly, and the interaction chains between them. This is a major reason we have no real idea how they work. We’ve designed a process for building holographic neural networks, and the technology is incredibly powerful while still in its infancy.

How is that relevant to us? The vectors themselves can be applied to basically anything. We’ve mastered the art of transforming text into vectors; they’re a simple mathematical object. So why not store the vectors? Remember, databases are astonishingly good at processing sets and applying mathematical formulas. We can make Postgres look like an LLM by exposing the mathematical underpinnings. Sorry ElasticSearch, you’re old news.

This is where extensions like pgvector come in. They bring native vector processing and indexing directly into the Postgres ecosystem, right on the database instance. Simply store a vectorized representation of the interesting data, apply an index or two, and Postgres is now an LLM. Sortof. There’s a critical step that’s missing: actually producing the vectors, also known as embeddings.

Vector Ingestion

This is where we start to rub the LLM directly on Postgres, and where our project actually begins. We need a real LLM to act as an intermediary to produce some vectors. The neat part about this is that it doesn’t have to be one of the big ones that fills a datacenter with thousands of GPUs, or even a particularly good one. LLMs like this are commonly called Transformers, because they literally exist purely to create vectors.

They’re actually relatively small, and can be run locally. Consider the all-MiniLM-L6-v2 sentence transformer, for instance. It weighs in at a svelte 90MB, and generates minuscule 384-dimension vectors. Unlike speaking with ChatGPT or Grok, we don’t want to incur a doctoral thesis for a simple question, we just need relative similarity to the existing corpus. Essentially we want to know how well a user question compares to data we already have.

With that in mind, consider the following table structure where we store some information about blog posts:

CREATE TABLE blog_articles (
  article_id    INT PRIMARY KEY NOT NULL GENERATED ALWAYS AS IDENTITY,
  author        TEXT,
  title         TEXT,
  content       TEXT,
  publish_date  DATE
);

If we then wanted to store vectors about this data, we’d put it in a separate table because it’s optional metadata. Something like this:

CREATE TABLE blog_article_embeddings (
  embedding_id  INT PRIMARY KEY NOT NULL GENERATED ALWAYS AS IDENTITY,
  article_id    INT NOT NULL REFERENCES blog_articles,
  chunk         TEXT,
  embedding     vector(384)
);

We get access to the vector column type after installing the pgvector extension. One important factor to consider about these tiny LLMs is that their vector space is also small. As such, we don’t want to feed them with too many tokens greater than their vector dimensions. Therefore each “chunk” is roughly equivalent to a single paragraph from the original article, as is the associated vector. That inherently ties the context of each paragraph to the underlying vector, which is actually fairly handy.

Why do we keep the chunks? So we can feed them back into the bigger LLM, so it can do what Postgres and the transformer can’t. For some fully honest disclosure here, we did originally attempt to use a single embedding to represent each article, and it was just as terrible as you might imagine. Result accuracy was laughably low, and it’s no surprise. An article of sufficient length can touch on a dozen tangentially related subjects, and our transformer simply isn’t sophisticated enough to distill that into a single vector. Alas.

A Corpus Amongus

In order to run this test properly, we need a lot of specific text. Some… kind of series of blog articles going back nearly a decade stored in a conveniently parsable format. Barring that, we have access to about 100 Markdown files for this very blog. To be fair, they’re not purely Markdown, but are Hugo format, which is a hybrid of a YAML header and a Markdown body. Either way, these should be trivially easy to parse.

So we wrote a python function to handle that portion:

def parse_hugo(filename):
  in_header = False

  with open(filename) as blog_file:
    content = ''
    header = ''

    for line in blog_file.readlines():
      if line.strip() == '---':
        in_header = not in_header
        continue

      if in_header:
        header = header + line
      else:
        content = content + line

    headers = yaml.safe_load(header)

  return {'content': content, **headers}

The function safely separates the YAML and Markdown portions, parses the YAML, and returns a dict of everything it found. The dict should contain the YAML attributes, and a single content section for the article body. It’s a bit rough around the edges, but it’ll get the job done.

Now that we can parse the files, we need something to transform the text body into one or more vectors. Preferably more, given what we now know about the limitations of relatively small vectors. This is the part where many projects will resort to the “throw money at it” approach.

More than Meets the Eye

OpenAI is currently associated with AI as a concept to many developers, and for good reason. They have an extremely convenient API that’s been bolted onto nearly every conceivable cloud service and tool due to their first-mover advantage. But they also—understandably—charge for every invocation of that API. Even at fractions of a penny, trivial actions such as generating a vector embedding can accumulate substantial bills.

Remember that 90MB transformer we mentioned earlier? Why not just use that? Well, we did!

from sentence_transformers import SentenceTransformer
from langchain_text_splitters import RecursiveCharacterTextSplitter, CharacterTextSplitter

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

splitter = RecursiveCharacterTextSplitter(
    separators = ["\n\n", "\n", ' ', '.', '\`\`\`'],
    chunk_size = 500,
    chunk_overlap = 20,
    length_function = len,
    is_separator_regex = False
)

def vectorize(content):
  chunks = splitter.split_text(content)
  embeddings = model.encode(chunks)
  return chunks, embeddings

There are many ways to split up our article text, and the LangChain library provides a lot of convenient methods for doing so. The RecursiveCharacterTextSplitter method we use here allows for overlapping chunks to account for uneven splits and longer paragraphs. More parameter fine-tuning could produce better block consistency, but it’ll work for now.

Now all we have to do is feed the article text into the vectorize function, and it’ll send back easily digestible chunks and the associated vectors. We can also leverage the vectorize function for user queries. We can build a vector for “How do I use materialized views?” and compare it to the vectors for each article, for instance.

Now it should be obvious how Postgres is gaining its new natural language abilities.

The Long Slurp

At this point, we just need to dump all of the data into our previously prepared tables. Once again, it’s a Python script to the rescue:

import psycopg2
from glob import glob
from pgvector.psycopg2 import register_vector

conn = psycopg2.connect(
  host = 'my-db-host', user = 'postgres',
  password = None, database = 'postgres'
)
register_vector(conn)

cursor = conn.cursor()

for blog in glob('./corpus/*.md'):
  article = parse_hugo(blog)

  cursor.execute(
    """INSERT INTO blog_articles (author, title, content, publish_date)
       VALUES (%s, %s, %s, %s) RETURNING article_id""",
    (article['author'], article['title'], article['content'], article['date'])
  )
  article_id = cursor.fetchone()[0]

  chunks, vectors = vectorize(article['content'])
  for i in range(len(chunks)):
    cursor.execute(
      """INSERT INTO blog_article_embeddings (article_id, chunk, embedding)
         VALUES (%s, %s, %s)""",
      (article_id, chunks[i], vectors[i])
    )

  conn.commit()

Despite how it looks, there really isn’t much going on here. We’re just looping through all of the Hugo Markdown files, parsing them for the sections, and inserting each article. Then after inserting the article, we get the ID it generated and associate that with all of the vector embeddings.

The best part is that we didn’t have to pay a single dime to OpenAI. Out of 101 articles, we generated and stored 2716 vectors, and that’s just the successful ones. We lost track of how many iterations were necessary to tweak the chunking parameters, or to change chunking strategy, or to make up for failed loads, or… Let’s just say the total number of transforms was substantially higher.

Regardless, the data made it to the database intact, we now have suitable vectors which pgvector can use, and we have the necessary scaffolding to locate those vectors using math.

Putting the R in RAG

Now that that we have data on our blog articles and associated vectors, we need a good query to retrieve that information. The pgvector extension overloads a few operators specifically for interacting with vectors. Among these is the <-> operator to find the nearest neighbor, which will be fine for our purposes.

Further, we split our data into chunks for another reason. Even within a single article, some sections will be more relevant to a question than others. The vectors allow us to isolate those specific instances and return them to the LLM for processing. But we also want to restrict the amount of rows returned because LLMs have limitations on prompt length. If we take a conservative approach, that means we don’t really want to embed more than three references into the prompt.

With that in mind, we designed this query:

  WITH matches AS (
    SELECT embedding_id, article_id
      FROM blog_article_embeddings
     ORDER BY embedding <-> %s
     LIMIT 10),
  weighted_matches AS (
    SELECT embedding_id, article_id, 
           count(*) OVER (PARTITION BY article_id) AS score
      FROM matches
     ORDER BY score DESC
     LIMIT 3)
  SELECT a.title, emb.chunk
    FROM weighted_matches wm
    JOIN blog_article_embeddings emb USING (embedding_id, article_id)
    JOIN blog_articles a USING (article_id);

This CTE does three things, in this order:

Fetch the top 10 matching reference sections.
Weight those based on the most results from the same article and use the top 3.
Grab the article title and the text chunk for each reference.

Some might argue that step 2 is unnecessary, and if we were going by strict vector similarity, they would probably be right. But we erred toward the fact that each article has a guiding theme. If a single article shows up 5 times in a vector search, it should probably be the source of most references. This is where the concept of design and intent determine how the app will behave, and there’s no real right answer.

We also applied the following index to ensure fast vector searches:

CREATE INDEX idx_article_embedding ON blog_article_embeddings
 USING hnsw (embedding vector_l2_ops);

Using EXPLAIN on the query shows it uses this index, and searches only require a few milliseconds no matter the user question. That means we can build the first half of any RAG app: retrieval. Consider this Python snippet:

question_vector = vectorize(question)[1]

references = ''
with conn.cursor() as cursor:
  cursor.execute(query, (question_vector))
  for (title, content) in cursor:
    references += "Reference from {}\n\n{}\n\n".format(title, content)

We’ve built a list of references where each is labeled based on the source article. The LLM can use this information as part of the results even if the original training data contained no such knowledge. Cool, right?

The only thing left is the final piece of glue that brings any RAG application to life: the LLM. The big one with all the brains and billions of parameters.

Some Call Me Tim

This is usually where the OpenAI API appears again. ChatGPT 4.0 is a valuable service and having direct programmatic access to the model is impossible to ignore. But does our toy RAG app need that kind of firepower? Do most projects even need it? Probably not. Why not use a different LLM locally?

The Mixtral 7B model is currently making waves as one of the highest performing in the industry. It has been shown to outperform 70B parameter models in certain tasks and some of the more heavily quantized versions can run natively on consumer GPU hardware. Two 24GB NVidia RTX-3090 GPUs are enough to run a full copy of the model, and scaled down versions of the model run mostly OK on a single card.

For the purposes of this experiment, we used a variant of the Mistral model known as Dolphin-Mixtral-8X7B by TheBloke. It’s tuned for coding and tended to produce better results for Postgres questions, and has good prompt tags so we can provide a distinct system prompt. This is important because we want our references to guide the LLM when answering user questions.

And this brings us to our final bit of Python:

from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain_community.llms import LlamaCpp

llm = LlamaCpp(
    model_path = './models/dolphin-2.7-mixtral-8x7b.Q4_K_M.gguf',
    temperature=0.3,
    n_gpu_layers=10,
    n_batch=512,
    n_ctx=2048,
    f16_kv=True,
    callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
    verbose=False,
)

template="""<|im_start|>system
You are a PostgreSQL database platform expert tasked with answering difficult user questions. You should defer to the following references in all respects, but are otherwise welcome to fill in gaps with your existing knowledge: 

{refs}

Produce thorough and insightful answers whenever possible, and do not answer any questions unrelated to Postgres.
<|im_end|>

<|im_start|>user
{q}
<|im_end|>

<|im_start|>assistant
"""

llm.invoke(template.format(refs = references, q = question))

That prompt template is a beast, isn’t it? Consider everything we’re actually doing here. We’ve told the LLM that it is the Postgres expert in this situation, and that it has access to several relevant references. We’ve told it to be as helpful as possible, and we’ve sidestepped a common failing in these kinds of solutions by preventing it from straying from its designated purpose as a Postgres expert. Otherwise, a user could ask about birds or home construction and our Postgres expert would happily comply.

Everything else is just our chosen libraries and invocation details. The secret sauce to any LLM interaction really is the prompt, which is why we chose this particular Mixtral variant. Note the definitive prompt sections which leave no ambiguity for the LLM. There are probably still escape paths a creative user could exploit, but we should have a dedicated Postgres expert in all but ridiculous edge cases.

Final Thoughts

Building an AI assistant isn’t a simple process, but I did all of this in a few days, including the time it took to write up the experience and build a small working example on github. The genie is out of the bottle, and these kinds of capabilities will be added to anything even remotely applicable. People already paste full questions into search bars, so why not just give them what they’re naturally asking for?

Postgres is a key part of this process. With tools like pgvector and pg_vectorize for maintaining embeddings with API calls, it’s possible to dramatically increase result accuracy. The tendency of LLMs to hallucinate is a difficult issue related to how imprecise and reconstructive probability-based responses are. By reinforcing the context available to the LLM, it has a more concrete basis for the answers it provides.

Even if we drop the augmented generation portion of RAG, vector-based results are a huge improvement in full-text search capabilities. Rather than looking for mere keywords, we can search for the intent of a question along the semantic meaning. Some transformers are better at this than others, but there are hundreds to choose from.

This technology will continue to evolve, and eventually we won’t even need to store the chunks or the vectors as separate entities. Once indexers gain the ability to specify a transformer, the index itself will act as the vector search. It’s an exciting time to be in the database industry, and Postgres is leading the charge thanks to its extensibility.

I, for one, welcome our robot elephant overlords.