I always advocate breaking up large Postgres tables for a few reasons. Beyond query performance concerns, maintaining one monolithic structure is always more time consuming and consequentially more dangerous. The time required to create a dozen small indexes may be slightly longer than a single larger one, but we can treat the smaller indexes as incremental. If we want to rebuild, add more indexes, or fix any corruption, why advocate an all-or-nothing proposition? Deleting from one large table will be positively glacial compared to simply dropping an entire expired partition. The list just goes on and on.

On the other hand, partitioning in Postgres can be pretty intimidating. There are so many manual steps involved, that it’s easy to just kick the can down the road and tackle the problem later, or not at all. Extensions like the excellent pg_partman remove much of the pain involved in wrangling an army of partitions, and we strongly suggest using some kind of tool-kit instead of reinventing the wheel.

The main limitation with most existing partition management libraries is that they never deviate from the examples listed in the Postgres documentation. It’s always: create inherited tables, add redirection triggers, automate, rinse, repeat. In most cases, this is exactly the right approach. Unfortunately triggers are slow, and especially in an OLTP context, this can introduce sufficient overhead that partitions are avoided entirely.

Well, there is another way to do partitioning that’s almost never mentioned. The idea is to actually utilize the base table as a storage target, and in lieu of triggers, schedule data movement during low-volume time periods. The primary benefit to this is that there’s no more trigger overhead. It also means we can poll the base table itself for recent data with the ONLY clause. This is a massive win for extremely active tables, and the reason tab_tier was born.

Let’s create some data for testing this out:

CREATE TABLE sensor_log (
  id            INT PRIMARY KEY,
  location      VARCHAR NOT NULL,
  reading       BIGINT NOT NULL,
  reading_date  TIMESTAMP NOT NULL
INSERT INTO sensor_log (id, location, reading, reading_date)
SELECT s.id, s.id % 1000, s.id % 100,
       CURRENT_DATE - ((s.id * 10) || 's')::INTERVAL
  FROM generate_series(1, 5000000) s(id);
CREATE INDEX idx_sensor_log_location ON sensor_log (location);
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);
ANALYZE sensor_log;

Now we have 5-million rows in a table with a defined date column that’s a perfect candidate for partitioning. The way this data is currently distributed, we have content going back to late 2014. Imagine in this scenario we don’t need this much live information at all times. So we decide to keep one week of logs for active use, and relegate everything else into some kind of monthly partition.

This is how all of that would look in tab_tier:

SELECT tab_tier.register_tier_root('public', 'sensor_log', 'reading_date');
UPDATE tab_tier.tier_root
   SET root_retain = '1 week'::INTERVAL,
       part_period = '1 month'::INTERVAL
 WHERE root_schema = 'public'
   AND root_table = 'sensor_log';
SELECT tab_tier.bootstrap_tier_parts('public', 'sensor_log');
                 List OF relations
 Schema |          Name          | TYPE  |  Owner   
 public | sensor_log             | TABLE | postgres
 public | sensor_log_part_201410 | TABLE | postgres
 public | sensor_log_part_201411 | TABLE | postgres
 public | sensor_log_part_201412 | TABLE | postgres
 public | sensor_log_part_201501 | TABLE | postgres
 public | sensor_log_part_201502 | TABLE | postgres
 public | sensor_log_part_201503 | TABLE | postgres
 public | sensor_log_part_201504 | TABLE | postgres
 public | sensor_log_part_201505 | TABLE | postgres
 public | sensor_log_part_201506 | TABLE | postgres
 public | sensor_log_part_201507 | TABLE | postgres
 public | sensor_log_part_201508 | TABLE | postgres
 public | sensor_log_part_201509 | TABLE | postgres
 public | sensor_log_part_201510 | TABLE | postgres
 public | sensor_log_part_201511 | TABLE | postgres
 public | sensor_log_part_201512 | TABLE | postgres
 public | sensor_log_part_201601 | TABLE | postgres
 public | sensor_log_part_201602 | TABLE | postgres
 public | sensor_log_part_201603 | TABLE | postgres
 public | sensor_log_part_201604 | TABLE | postgres
 public | sensor_log_part_201605 | TABLE | postgres

Taking this piece by piece, the first thing we did after creating the extension itself, was to call the register_tier_root function. This officially tells tab_tier about the table, and creates a record with configuration elements we can tweak. And that’s exactly what we do by setting the primary retention window and the partition size. Creating all of the partitions manually is pointless, so we also invoke bootstrap_tier_parts. Its job is to check the range of dates currently represented in the table, and create all of the partitions necessary to store it.

What did not happen here, is any data movement. This goes back to our original concern regarding maintenance. Some tables may be several GB or even TB in size, and moving all of that data as one gargantuan operation would be a really bad idea. Instead, tab_tier provides the migrate_tier_data function to relocate data for a specific partition.

With a bit of clever SQL, we can even generate a script for it:

  SELECT 'SELECT tab_tier.migrate_tier_data(''public'', ''sensor_log'', ''' || 
         REPLACE(part_table, 'sensor_log_part_', '') || ''');' AS part_name
    FROM tab_tier.tier_part
    JOIN tab_tier.tier_root USING (tier_root_id)
   WHERE root_schema = 'public'
     AND root_table = 'sensor_log'
   ORDER BY part_table
) TO '/tmp/move_parts.sql';
\i /tmp/move_parts.sql
SELECT COUNT(*) FROM sensor_log_part_201504;

Following some debugging notices, all of our data has moved to the appropriate partition. We verified that by checking the base table and a randomly chosen partition for record counts. At this point, the table is now ready for regular maintenance. In this case “maintenance” means regularly calling the cap_tier_partitions and migrate_all_tiers functions. The first ensures target partitions always exist, and the second moves any pending data to a waiting partition for all tables we’ve registered.

And that’s it. We’re completely done with this table. If we stopped here, we could be secure in the knowledge we no longer have to worry about some gigantic monolith ruining our day some time in the future. But that’s not how tab_tier got its name. One or two levels does not a tier make; the real “secret sauce” is its support for long term storage.

One thing we didn’t really cover, and most partition systems never even consider, is that partitioning is only half of the story. On an extremely active system, having months or years of data just sitting around is relatively frowned upon. The mere presence of older data might encourage using it, transforming our finely tuned OLTP engine into a mixed workload wreck. One or two queries against those archives, and suddenly our cache is tainted and everything is considerably slower.

We need to move that data off of the system, and there are quite a few ways to do that. Some might use ETL scripts or systems like talend to accomplish that goal. Or we can just use tab_tier and a Postgres foreign table. Let’s now dictate that only six months of archives should ever exist on the primary server. Given that constraint, this is how we could proceed:

-- Do this on some kind of archive server
CREATE USER arc_user PASSWORD 'PasswordsAreLame';
CREATE TABLE sensor_log (
  id            INT PRIMARY KEY,
  location      VARCHAR NOT NULL,
  reading       BIGINT NOT NULL,
  reading_date  TIMESTAMP NOT NULL,
GRANT ALL ON sensor_log TO arc_user;
-- Back on the data source..,
UPDATE tab_tier.tier_root
   SET lts_threshold = '6 months'::INTERVAL,
       lts_target = 'public.sensor_log_archive'
 WHERE root_schema = 'public'
   AND root_table = 'sensor_log';
CREATE EXTENSION postgres_fdw;
CREATE USER arc_user PASSWORD 'PasswordsAreLame';
GRANT tab_tier_role TO arc_user;
  FOREIGN DATA WRAPPER postgres_fdw 
  OPTIONS (dbname 'postgres', host 'archive-host');
  SERVER arc_srv 
  OPTIONS (USER 'arc_user', password 'PasswordsAreLame');
CREATE FOREIGN TABLE sensor_log_archive (
  id            INT,
  location      VARCHAR NOT NULL,
  reading       BIGINT NOT NULL,
  reading_date  TIMESTAMP NOT NULL,
) SERVER arc_srv OPTIONS (TABLE_NAME 'sensor_log');
GRANT INSERT ON sensor_log_archive TO tab_tier_role;
-- Connect as arc_user, then run this:
SELECT tab_tier.archive_tier('public', 'sensor_log');
SELECT COUNT(*) FROM sensor_log_archive;

Whew! That was a lot of work. Maybe a future version of tab_tier should provide a wrapper for that. In any case, all we did was set up a foreign table on a remote server, create a separate user to handle the data movement, and tell tab_tier about our six month threshold for long term storage, and the target table itself.

Using a foreign table isn’t required here, since the target can be any kind of table, but isn’t that the whole point of this exercise? The cool thing about Postgres foreign data wrappers is that we could have used any of them. In this case we’re just moving data to another remote Postgres instance, but we could have dumped everything into Cassandra or Hadoop instead. Take that, subspace!

For those who noticed all of the ridiculous GRANT statements, please remember this is only for demonstration purposes. A real system would probably use ALTER DEFAULT PRIVILEGES to give tab_tier_role more limited control over a specific schema and tables specifically designed for archival. The extension doesn’t add its own privileges—even to tables it creates—in case controls are tightly locked down. We don’t want to hijack any carefully laid down security. Instead tab_tier just propagates any ACLs it finds on root tables to new partitions.

This is the same reason we ran the archive_tier (or archive_all_tiers) routine as a different user. Since we’re using a foreign user mapping, we want to limit data leak potential by isolating the movement process from the table owner or a superuser. We recommend using this approach for any foreign table usage whenever possible.

With all of that out of the way, we still need to clean up. We archived all of the partition content, but the partitions themselves are still sitting around and gathering dust. Let’s fix that by running one final step as the owner of sensor_log or any superuser:

SELECT part_table
  FROM tab_tier.tier_part
 WHERE is_archived;
SELECT tab_tier.drop_archived_tiers();
SELECT COUNT(*) FROM sensor_log_archive;

During the archival process itself, tab_tier marks the related metadata so archived tables will no longer be used in any of the data movement functions. It also makes them an easy target for removal with a maintenance function. We can see that everything worked as a large portion of our data is no longer part of the sensor_log inheritance tree. Now the archived data is securely located on another system that’s probably geared more toward OLAP use, or some incomprehensible Hive we don’t have to worry about.

I for one, welcome our incomprehensible Hive overlords.

PG Phriday: Trusty Table Tiers
Tagged on:             

8 thoughts on “PG Phriday: Trusty Table Tiers

  • Hello , very interesting and powerful new approach to partition the data .. but I have a doubt , following his example , perform the following test:

    — insert information with future-dated

    INSERT INTO sensor_log (id, location, reading, reading_date) SELECT s.id+5000000, s.id % 1000, s.id % 100, CURRENT_DATE + ((s.id * 10) || ‘s’)::INTERVAL FROM generate_series(1, 1000000) s(id) order by random() limit 300;

    SELECT COUNT() FROM sensor_log; — 6 millones SELECT COUNT() FROM only sensor_log; — 1060480

    select tab_tier.cap_tier_partitions(); select tab_tier.migrate_all_tiers();

    SELECT COUNT() FROM sensor_log; — 6 millones SELECT COUNT() FROM only sensor_log; — 1060480 ????

    why do not generate additional partitions ??

    1. excuse me, the script:

      INSERT INTO sensor_log (id, location, reading, reading_date) SELECT s.id+5000000, s.id % 1000, s.id % 100, CURRENT_DATE + ((s.id * 10) || ‘s’)::INTERVAL FROM generate_series(1, 1000000) s(id);

      SELECT COUNT() FROM sensor_log; — 6 millones SELECT COUNT() FROM only sensor_log; — 1060480

      select tab_tier.cap_tier_partitions(); select tab_tier.migrate_all_tiers();

      SELECT COUNT() FROM sensor_log; — 6 millones SELECT COUNT() FROM only sensor_log; — 1060480 ???

    2. This wasn’t designed with future dating in mind, as it wasn’t a use-case I’d considered. It’s not trivial to add, so it’s probably low priority unless there’s a compelling reason for an archival partition system to look in the future. After all, how can the base table have the most recent data and archive older data, if all data is “recent”?

      1. Hi, I have two case with this type of information: a system that generates projections stock market, scheduling of medical appointments

        1. Noted. Just know that this extension was not designed for that. The point of this extension is to address hot zones, where a base table benefits from having a very small set of data that’s extremely active. Older data gets stashed in older partitions, and then eventually sent into long term storage.

          Having your data set in the future, it’s likely those projections or appointments will be touched again for various reasons, so you’d want it in the base table. It shouldn’t be moved into partitions or long term storage until it has aged sufficiently to keep it out of the hot (frequently updated or viewed) data zone.

          Reading from partitions is still relatively slow compared to addressing the base table. That’s the primary reason this extension leaves data in the base table instead of using triggers to redirect to partitions immediately. You want your frequently accessed data in the base table. That’s kinda the whole point. 🙂

          1. mmm is reasonable, in any case is very powerful extension and also have tables with call log that works perfectly . Thank you so much!

        2. You have got me thinking, though… The migration functions only look at recent data because they assume no older data gets into the base table. But if some old data does somehow get inserted, or the migration functions don’t get called for some reason, older data will never be migrated. So I clearly need to add a function that performs a full migrate to flush out any lingering data to address that problem.

          Provided I also modify the bootstrap function to create partitions for post-dated content as well, you could then use that function to migrate data for any existing partition. I still think that would be misusing the extension, but it would at least be possible.

Comments are closed.