As a DBA, I strive not to live in an isolated ivory tower, away from the developers that are destined to fill our databases with volumes of data from a myriad of applications. It’s important, I think, to occasionally come back to the basics. So I’d like to discuss one of the core elements that PGDB DBAs might be intimately familiar with, but comes up often enough that some clarification is warranted. How does PostgreSQL store data? The answer may surprise you, and is critical to understand before regularly interacting with a PGDB system.

The storage engine PGDB uses is called Multi Version Concurrency Control, or MVCC for short. There is excellent documentation of the subject, along with further resources that attempt to convey a thorough understanding, and succeed in that regard. I strongly encourage anyone who works with PostgreSQL in any context, to read these until they can recite the basics from memory. It is the very bedrock of the database.

But we need an executive summary, so let’s start with an extremely simple table with an id and generic column:

ID Stuff
11 foo
21 bar
31 baz

Let’s say that some time in the future, we update ID 2. This would be the table at that point:

ID Stuff
11 foo
21 bar
31 baz
218 moo

Notice that there are now two copies of the row for ID 2. How is it possible for PGDB to differentiate between them? Which one is “correct”?

To answer those two questions, I’ll need a short aside to explain transaction counters. PGDB keeps a counter that it assigns to all database activities, regardless of origin. Because these numbers can’t cycle endlessly, it currently wraps after two billion values have been used. It then applies an “age” algorithm to produce an abstract representation of data antiquity. For the purposes of this discussion, the subscript in the examples represents the transaction age.

Every action gets a transaction ID, and every transaction ID acts like a snapshot because it provides a stable comparison point. Even if you don’t explicitly call BEGIN TRANSACTION, each statement gets a transaction ID, and thus an age to determine visibility.

Now, take note of the subscript for the second row version in the above example. It indicates that the second row was inserted by transaction 18. If transaction 5 were still open for some reason, it would not see this row, and thus the old version is still valid within that context. But transaction 20 would see both, and use the newest version.

This is how MVCC works, in a nutshell. The primary benefits here are related to blocking. Readers can’t block readers, as they’re not creating new row versions. Further, writers won’t block readers, as the old row version is still valid, and the reader would not see the new row version anyway due to transaction age. And naturally, readers can’t block writers, as reading data doesn’t inherently change its representation. A lot of databases have extremely complex and disruptive rules and locking systems to enforce these outcomes, but it’s just a natural part of MVCC.

Now imagine we update that row again:

ID Stuff
11 foo
21 bar
31 baz
218 moo
242 pop

Now there are three versions of the same row. This will keep happening for every UPDATE that row endures. As you might imagine, this is extremely undesirable in the long term; a very active table that receives a high volume of updates would quickly bloat out of control. And you’d be right. It’s not uncommon for high transaction volumes to quickly render tables 90+% old junk data.

PGDB solves this problem with VACUUM. The VACUUM system is given a transaction ID like everything else, but it can also do basic visibility testing because it knows which transactions are still open within the system. It works by recycling any row, provided the transaction age is lower than its own, and it’s no longer visible to other transactions. After being vacuumed, our table looks like this:

ID Stuff
11 foo
21 bar
31 baz
218 moo
242 pop

Because PGDB wants to reduce potential locking contention, the old row is not actually removed from the data file. Instead, the location is entered into a free space map. Any row in the free space map will be reused by new row versions. Let’s add a new row:

ID Stuff
11 foo
594 fun
31 baz
218 moo
242 pop

There are two major benefits to this approach:

  1. New row versions may not cause the table data files to be grown.
  2. New row versions are not always inserted at the end of the table. This area is known as a write “hot zone” and depending on the underlying file system, can be a huge source of contention and write delays.

However, this means our maintenance can never fall behind or we risk filling the file system with a bunch of old, useless, junk data. Table activity becomes inherently slower as bloat increases, adversely impacting performance in several different ways. Even worse, without regular vacuums, the transaction ID reuse that I mentioned earlier isn’t properly reset. Without that, the database will eventually shut down since it can’t generate unique transaction IDs for versioning data.

I won’t get into the automatic vacuum system that generally addresses this, but I can say that it’s not a complete solution. All vacuums function by reading table data files and maintaining the visibility map. This activity is necessarily limited by disk performance, and various internal settings that prevent vacuums from consuming all available disk throughput. Due to these limitations, tables that receive a high volume of updates will always be larger than their ideal size—sometimes much larger.

And this is where application developers come in. I ask that you think about how MVCC works, and consider how you develop a program that needs to store data. MVCC does not lend itself well to certain use cases, such as session tracking where each page load causes an update in the underlying table. There are better approaches for storing fungible session states than a persistent row in the database.

Most scenarios are fully suitable to how PGDB stores data. It’s a very powerful and capable database engine with far fewer locking problems than other systems. But we pay for that flexibility with extra maintenance concerns, and considerations for avoiding excessive update volume. Certainly, PGDB will function without specifically addressing these points, but it works better when everyone along the chain knows how to circumvent inherent limitations.

Know your tools, and they will rarely disappoint you.

PG Phriday: Let’s Talk About Data Storage
Tagged on:                         

One thought on “PG Phriday: Let’s Talk About Data Storage

  • Thanks for the nice explanation. For what it’s worth, PostgreSQL does have facilities for data that’s updated with great frequency and without a need to persist across natural disasters, namely unlogged tables. Any chance you might write those up in a future article?

Comments are closed.