PG Phriday: Partitioning Candidates

What’s a good table to partition? It’s not always a question with an obvious answer. Most often, size and volume determine whether or not a table should be broken into several chunks. However, there’s also cases where business or architecture considerations might use partitioning to preserve a shared table structure, or drive aggregate analysis over a common core. In Postgres (PostgreSQL), this is even more relevant because of how partitioning is handled through inheritance.

For now, let’s focus on elements that can be directly quantified, such as size and row count. Luckily, the Postgres system catalog contains this information, so let’s give it a look. Let’s see what the catalog has to say about the sensor_log table introduced in last week’s article on The Case for Partitioning.

SELECT oid::regclass::text AS table_name,
       pg_size_pretty(pg_total_relation_size(oid)) AS total_size,
       ceil(reltuples / 1000000) AS million_rows
  FROM pg_class
 WHERE relkind = 'r'
   AND reltuples > 10000000
 ORDER BY relpages DESC;

 table_name | total_size | million_rows 
------------+------------+--------------
 sensor_log | 12 GB      |          100

With the help of a couple administration functions, we can focus on any tables that are immediate outliers. In this case, the only criteria that really matters, is that we want to see any tables with more than 10-million rows. Why that particular number? Tables that cross this threshold often continue to grow relatively quickly due to a consistent data stream. This volume tends to increase over time, and large tables face a number of problems.

Consider the sensor_log table as a basis for analysis. At 12GB, it probably fits in memory fairly easily. In addition, it’s extremely narrow, having only four columns. Yet even with this going in our favor, Anything that needs to read the entire table will encounter occasional issues. Specifically:

  • Rebuilding a single index takes 2-5 minutes on a completely idle server with no memory pressure. Change any of these variables and the rebuild time can increase dramatically.
  • Directly vacuuming takes 1-2 minutes because it uses all available IO. Auto-vacuum however, halts frequently to prevent interfering with normal operations and can persist for hours depending on activity. Postgres tables must be vacuumed to reclaim dead rows because of MVCC. The reusable rows are not available until the vacuum is complete. As the table size increases, so does the time required to vacuum it, and thus the chance of table or index bloat since rows are not recycled as frequently.
  • Large bulk inserts, updates, or deletes exacerbate the above concerns since they are likely to trigger auto-vacuum or auto-analyze.
  • Risk of process interruption increases. Dumping or importing larger tables is generally safe, yet the costs of canceling, resuming, or retrying are significantly higher.

I’ll illustrate some of these points with a couple of anecdotes.

I frequently work with ridiculously over-provisioned systems. One of these was completely PCIe NAND-based storage with over 128GB of RAM. Even so, we had a single 500GB table with over 250M extremely wide rows based on interactions with an upstream service exchange. Occasionally we would run maintenance where we would export large chunks of the table would to disk before being archived and removed. After that process completed, it wasn’t uncommon to dump the table and re-import it, and then rebuild the indexes.

Even after reducing the table by 50%, simply rebuilding the indexes required at least an hour for each one. For one particular composite index on a couple of VARCHAR columns, it was closer to two hours. This was literally as fast as the machine could perform the task. This was despite taking several extreme steps to reduce these delays and related downtime.

  • We increased maintenance_work_mem to 1GB to match the Postgres base file size.
  • We pointed base/pgsql_tmp at /dev/shm, a shared-memory RAM drive.
  • We wrote a custom process to rebuild the indexes in parallel.

None of this was enough. After including verification steps, we could not reduce the outage window below two hours. All because a single table acted as a single entity comprised of a vast amount of data. Physics has its limitations, and processing 250-million of anything requires vast resources. If the table was partitioned by month, we could have made better use of the idle processors and reduced the downtime to a mere twenty minutes.

Another more recent instance illustrates the risks associated with bulk loading large structures. As is sometimes common with bulk data, a large volume of data is imported, and to prevent build-up, older data is sent to an archive server and deleted locally. The problem for this second example is sheer volume. Every single day consists of over 100-million data points.

The loading process itself is extremely efficient, making use of COPY for direct imports from the standard input stream, parallel threads, small batches, the works. Unfortunately the tables were originally designed for 1/10th that amount of data. Thus once the archival job starts, the real pain begins. Isolating the old rows is extremely slow. Removing them with DELETE is even slower. Then vacuuming a 500GB-1TB table continues to harass the storage subsystem.

What could we do instead? Partition. Then the archive process would look like this:

  1. Find oldest partition.
  2. Obtain the contents and send them to the archive.
  3. Drop the partition.

No slow DELETE. No VACUUM. And instead of a SELECT statement to retrieve the rows, we can just use COPY to consume the entire partition. Revising the table architecture in this case prevents several fruitless hours of needless disk churning and associated risks of delaying dependent jobs.

One of the major benefits of Postgres partitioning is to make use of constraint exclusion. This allows the query planner to only consider partitions that have a WHERE clause matching the partition constraint. This approach can vastly improve query performance in some cases, especially where sequence scans are concerned. Instead of scanning hundreds of millions of rows, the database might only need to touch some fraction of that amount. There’s no other way to have a “partial” sequence scan.

But partitioning is so much more versatile than simply reducing query costs. As above, maintenance considerations can play a major role in determining partitioning candidates. Even if a table simply accumulates rows day after day, and no process ever updates or deletes anything for the duration of its existence, partitioning is still a good idea. Why?

Statistics. In order to make informed decisions regarding indexes, looping, and so on, PostgreSQL builds statistics for tables during the ANALYZE process. The default_statistics_target setting has a maximum value of 10,000. This means that no matter the size of the table, the database will never sample and calculate more than 10,000 values per column. Because of the costs involved with tabulating these statistics, I rarely set this over 1000, and query plans are generally fine with at least 100 samples.

For large tables however, this becomes more important. With only 1000 samples potentially representing billions of rows in a single table, the representative capability of that data becomes diluted and inconsistent. If a large table is loaded during the day, for instance, the amount of data for the current day is vastly outweighed by past data. This can cause queries that only need current data to be underestimated, and then the database uses a nested loop because it thinks there are only ten results, when there are really two million.

We’ve covered nested loops and how they can be problematic in the past. Looping through two million unexpected random reads is a worst-case scenario, but it happens every day due to inadequate statistics. If, however, current data were isolated to a more relevant partition, this scenario would be much less likely. Combined with constraint exclusion, query execution times can be reduced from several hours to a few milliseconds simply due to planning improvements.

And yet businesses continue to suffer with monolithic tables containing billions of rows, wondering why everything is so slow. They complain that it takes hours, or even days, to rebuild certain tables due to excessive turnover causing runaway bloat. They throw faster disks and more memory at the problem, confused at why that didn’t help.

It doesn’t have to be that way, if we use the tools we’re given.