PG Phriday: Dealing With Table Bloating

Friday September 11, 2015

Most Postgres operators and informed users are aware that it uses MVCC for storage. One of the main drawbacks of this versioning mechanism is related to tuple reuse. In order to reuse the space, VACUUM must complete a cycle on the table. Unfortunately this isn’t always possible to “optimize” for larger tables. How so?

If a large table needs to have a calculated column added, or some other bulk query updates a large portion of its content, a large fragment of the table is now empty space. This can be painful in a warehouse scenario, but it can be even more disruptive if a table that once fit in memory is now causing endless disk paging on a transaction-heavy system. It can mean the difference between a server being driven into the ground by IO requests, or one that runs smoothly with an order of magnitude more requests.

Take, for instance, a product mapping table:

DROP TABLE IF EXISTS product_map;

CREATE TABLE product_map
(
    map_id        SERIAL  PRIMARY KEY,
    product_id    INT     NOT NULL,
    vendor_id     INT     NOT NULL,
    product_code  TEXT    NOT NULL
);

INSERT INTO product_map (product_id, vendor_id, product_code)
SELECT a.id, b.id, 'P' || abs(100000 - a.id)::TEXT
  FROM generate_series(1, 100000) a(id),
       generate_series(1, 10) b(id);

ANALYZE product_map;

This architecture is fairly common when mapping vendor product codes with internal tracking numbers. The approach inoculates us from vendor dependency or collisions caused by two vendors using the same ID system. These tables don’t tend to be very large, and here is how Postgres sees ours:

SELECT pg_size_pretty(pg_relation_size('product_map'));

 pg_size_pretty 
----------------
 50 MB

In this case, we have a million mappings with ten vendors and a made up string product code. Imagine for a moment that a few vendors changed their code systems, or we made a mistake loading the table. Now we need to update a large portion of the table. To simulate this, here’s an update statement that will modify half of the rows to have different product code values. We’ll also check the size of the table following the update.

UPDATE product_map
   SET product_code = 'D' || abs(50000 - product_id)
 WHERE vendor_id > 5;

SELECT pg_size_pretty(pg_relation_size('product_map'));

 pg_size_pretty 
----------------
 75 MB

This is where our story begins. The table is now 50% larger than before, but contains the same number of rows. This won’t affect index usage, but any query that reads the entire table is in trouble. Not only does this slow down VACUUM, but even something as simple as a daily report would have to work that much harder to summarize the contents of this table. Ad-hoc jobs will also take longer to complete.

This is fine with our measly million row table, but imagine a content map instead of a product map. Every book, song, movie, or piece of media supplied by an outside source could have multiple entries in this table. That inflates our row count into the tens or hundreds of millions. If 50% of that were empty space due to a bulk operation, we could be in trouble.

This kind of thing happens every day. If a company doesn’t have a Postgres DBA, it might take them by surprise when a nicely running system becomes a stuttering monstrosity because they changed some data the previous night. Not only is VACUUM using disk cycles trying to keep up with maintaining a multi-million row table, but the table no longer fits in memory and is causing massive IO thrashing.

The only permanent escape is to fix the table. There are two mainstream ways to accomplish this that are supported by SQL syntax. We can either CLUSTER the table by reorganizing it by one of the indexes, or use VACUUM FULL to rebuild it. Here’s what the table looks like after using CLUSTER on it:

CLUSTER product_map_pkey ON product_map;

SELECT pg_size_pretty(pg_relation_size('product_map'));

 pg_size_pretty 
----------------
 50 MB

There, back to normal! Sadly there are some major caveats to using this approach:

Both methods require an exclusive table lock for the duration of the process.
Both are serial processes.

This is primarily a concern because such an exclusive lock prevents even reading the table until the command completes. If there’s no available maintenance window to accommodate potentially hours of locking a table, that’s a huge complication. That drawback alone could prevent fixing the table in all but the most dire of circumstances. This also makes Postgres look bad to anyone who just wants something that works; MySQL doesn’t have such a caveat after all.

This is all related to how the commands actually function. Here’s a quick summary of what’s going on:

Create a copy of the table as an empty framework.
Copy the active table rows from the old table.
Create each index.
Atomically rename the table to replace the old one.

They do all that work behind the scenes so it resembles a single operation. Most of these steps are either destructive or modify the underlying table structure in some way. So long as the table is in flux, it’s not safe to use it. So, we wait. We wait for the rows to be copied. We wait for the indexes to be recreated based on the new page locations. We wait.

The problem is that the index creation step is serial. If there were six indexes, Postgres would dutifully create each, one after the other. That’s something that could be done in parallel, but not in current versions of Postgres. So if we have a table with one hundred million rows and four indexes, that could be several hours of rebuilding, during which, the table is completely unusable.

Enter pg_repack, a fork of the older pg_reorg project. It boasts the ability to reorganize a table without an exclusive lock. As an added benefit, it can rebuild indexes in parallel, drastically reducing the time spent on this step. Now we have an external tool that not only allows us to do maintenance after bloating a table, but finishes quicker too.

This is particularly relevant with operations that can’t be streamlined. The UPDATE statement we used might be followed up by an INSERT for example. Well, VACUUM doesn’t work in transactions, so we would have to fragment the job to specifically use two separate transactions separated by an explicit VACUUM. Yet the very nature of an RDBMS makes this an undesirable approach. Transactions and atomic operations protect data integrity, yet we have to manually circumvent this in order to avoid unnecessary table bloat.

Having a non-invasive maintenance tool like this is critical to Postgres. So critical that whatever magic it uses should be part of the core engine. MVCC, despite its benefits, makes maintenance an integral part of operating a Postgres database. When performing that maintenance becomes invasive, disruptive, or requires circumventing normal operation, people notice. It’s one of the few remaining hurdles to mass adoption Postgres still faces.

Without a postgres DBA on staff to get a project beyond these issues, the perception becomes that it’s a pain in the neck to use. Unfortunately, Postgres DBAs are in short supply. That is, after all, why I’m writing these articles.

See Also