How Donald Trump Happened

November 9th, 2016 | Published in Rant | No Comments


I know a lot of people watched the election results in disbelief last night, or woke up this morning and thought something like this:

You maniacs! You blew it up!

There’s a bit of sad truth there. But the real problem is how we reached the point where this was even possible. The amount of incredulity on display here is actually quite shocking to anyone that was paying attention. I knew Trump had some small chance given the political environment in America right now, yet I never thought he could actually win. The reason behind that misapprehension should shock some sense into every single one of us.

The Echo Chamber

One of Trump’s primary criticisms regarding coverage of his campaign and of himself is media corruption. Can we really claim this was a mischaracterization? Back in March, the Washington Post ran 16 negative articles about Bernie Sanders in a single day. Considering the DNC itself was against his nomination, that’s hardly a surprise. Donna Brazile, the replacement DNC Chair (and former CNN correspondent) after Debbie Wasserman Schultz resigned (to work for Hillary), colluded with CNN for early access to debate questions.

But it’s not just direct DNC involvement. A ridiculous 91 percent of Trump news coverage from July to October has been negative. New York Times? Bought. ABC? Paid for. NBC? In the bag. It just goes on and on like this. Not even The Onion is safe, having been acquired by Univision, which just happens to be chaired by a top Clinton donor. Let’s not forget Google. Add in naked propaganda organizations like Correct The Record specifically designed to target social media, and you have a perfect storm of constant negative coverage.

It worked against Sanders, so I guess the theory was that Trump would suffer the same result. The problem is all of this positive Clinton news and negative Trump bias caused a feedback loop. Media was faced with the problem of selling their new narrative. Polling is acknowledged as an inexact science, so a little Democratic oversampling is easy to dismiss or ignore. Despite the real numbers, it would appear as though Hillary had the upper hand. This may or may not serve to discourage Republican voters, but the inherent danger here is that it does the opposite. Secure in their knowledge that Trump is a joke and has no chance at winning, Democrats might actually be more likely to skip the vote. Especially for a candidate they don’t enthusiastically support.

It became an endless litany on the news, in the papers, in social media, on forums, and everywhere else: Trump is a joke. Trump is sexist. Trump is evil incarnate and eats babies while pairing them with the wrong wine. The bottom line is that Trump has no chance, regardless of how much support he might actually have. By ignoring objectivity, media failed to identify legitimate voter concerns that might result in support for Trump.

We live in a time where media garners advertiser revenue through endless editorializing. They play one side against the other for views just as shamelessly and odiously as clickbait headlines. That our political landscape is a Coke versus Pepsi analogy—with all of the associated emotional investment—should be no surprise to anyone. So when Trump claims the media is biased, everyone believes him. As a result, little to nothing the media says about Trump could ever be taken seriously.

Naive, Not Evil

So who would vote for Donald Trump, that racist, misogynist, insecure, shameless, huckster clown? Well, over 59 million people did just that, and though mentally expedient, considering all of them to be misogynist racists is a mere application of othering. Framing opposing viewpoints as unworthy or even repulsive is a favorite pastime of Humanity. Maybe Trump supporters have concerns besides identity politics.

Michael Moore recently called Trump a “human Molotov cocktail“. Yet why might someone feel that they want to burn the system down in the first place? Well, we’ve lost 5 million manufacturing jobs since 2000 for one. Health care costs have tripled since 2001. Rents have quickly outpaced inflation while median income is stagnating. Meanwhile, our infrastructure is crumbling around us.

People most affected by these problems are scared and angry. If not redress, they demand and deserve recognition. Instead, they received disdain and open derision. These are the people Hillary described as a “basket of deplorables“. That kind of tone-deaf rhetoric has consequences, and is no way to ingratiate yourself to a voting demographic.

Trump on the other hand, had a different message. He wants to make America “great” again. Whether or not such a thing is possible, that’s the mantra. Which candidate would you vote for, given only one will even acknowledge your existence? Will Trump deliver on any of his promises? Of course not; the system doesn’t work that way. But as a technique for garnering support, it’s a flawless and battle-tested approach.

There’s also the constant undercurrent of cognitive dissonance this election. Many Republicans repudiated Trump by voting for Hillary this year, but not everyone is so willing. For better or worse, Trump was the GOP nominee for 2016, and as such, receives a lot of automatic base support. Tradition is a something of a double-edged sword. An average Republican voter faced with Trump as a candidate must make a decision. Has the GOP, the party they’ve supported all of their lives, the party their friends and family have vouched for and rallied behind for generations, be wrong? Or is the media just being its usual disingenuous self?

No, it is the voters who are wrong!

Given how much hyperbole has defined this election, it’s probably easier to believe Trump is misquoted, or has basic human flaws that are being exaggerated. There’s a reason Trump has been so ruthless in lambasting all forms of media and journalism; he knows nobody trusts them anymore and wants to reinforce that message. It’s just dirty politics trying to keep your guy out! How shameless can they get?!

A Flawed Candidate

I already wrote a long diatribe mostly centered on Hillary. The main defense I’ve heard is that Hillary only looks bad because the Republicans have been maligning her for 30 years. Well, so what? Even if that were the case and all of her scandals are conspiracy fever dreams, that’s the public perception. You know, the people who vote? Both Hillary and Trump are among the least favorable candidates in history. Nominating her was a critical mistake by the DNC. Despite all of the owed favors that led to the situation, this should have never happened in the first place. If it was so important to nominate a woman for 2016, why not someone with orders of magnitude less political baggage?

Consider the average undecided or even Republican voter. To them, Hillary is essentially riding Bill’s coattails and represents another family dynasty. As if George Bush I and II weren’t bad enough. She gives $200k speeches to the same soulless banks that are ruining the economy. She’s openly calling for a no-fly zone in Syria, which could cause another cold war with Russia. What is this, 1950? She deleted evidence after receiving a subpoena, and was never held accountable. Remember when Martha Stewart went to prison for merely lying to the FBI?

To the general public, she’s an unaccountable, corrupt, establishment candidate. Instead of the guy who regularly drew record breaking crowds, we got stuck with a widely hated demagogue because it was “her turn”. With a choice between Hillary and the embodiment of a narcissistic scam artist, why not vote to burn the whole thing down? It certainly sends a message.

Republicans hated her. Independents weren’t very enthusiastic for her. Even long-time Democrats begrudgingly accepted her as some Republicans probably accepted Trump. And still I believed she would win. She would be able to nominate at least one Supreme Court seat, after all. Given her health problems, it’s possible Tim Kaine would take over, and he’s about as unoffensive as possible. That would have been a much safer result than setting fire to the whole establishment by voting Trump. As the results started rolling in, I went from surprised to dismayed, just like too many others.

That means even I was taken in by the media narrative. I went into the election knowing everything I just wrote, and yet I assumed Trump was enough of a caricature that it wouldn’t matter. This is the true and terrible aftermath of ignoring inconvenient realities. The media never sufficiently did their job, instead electing to butter our asses with sensationalism and melodrama. This is the result. We now have a reality TV star who mugs like a 1930’s mob boss as our Commander in Chief.

Are you fucking kidding me?

Thanks to the bite-sized inoffensive pap the media has been feeding us for the last couple decades, we’re now Idiocracy. Thanks, assholes.


Tags: , , ,

PG Phriday: MySQL Mingle

October 28th, 2016 | Published in Database, Tech Talk | No Comments


Through the wonderful magic of corporate agreements, I’ve been pulled back into (hopefully temporarily) managing a small army of MySQL servers. No! Why can’t this just be a terrible nightmare?! Does anyone deserve such debasement?

Side effects of using MySQL may include...

Side effects of using MySQL may include…

Hyperbole? Maybe a little. If MySQL was really that terrible, it wouldn’t be in such widespread use. However, as a Postgres DBA for so many years, I’ve come to appreciate what really sets it apart from engines and development approaches like those showcased in MySQL.

Let’s explore a few of these.

Privileges

MySQL user management is an unholy abomination. As ridiculous as it sounds, this is no exaggeration.

This is how to create a superuser in Postgres:

CREATE USER some_bozo WITH PASSWORD 'cowsrule' SUPERUSER;

And here’s the same thing in MySQL:

CREATE USER 'some_bozo'@'%' IDENTIFIED BY 'cowsrule';
GRANT ALL ON *.* TO 'some_bozo'@'%';

That’s not so bad, right? Syntactically no, but the implication of the @ symbol can’t be ignored. Consider these two users:

CREATE USER 'just_a_girl'@'127.0.0.1' IDENTIFIED BY 'justwhy';
CREATE USER 'just_a_girl'@'192.168.56.10' IDENTIFIED BY 'you suck';

Yes, that’s two distinct users. It just so happens that the second IP address is for the local VM running MySQL itself. Depending on the host specified in the connection string, we would need to supply a different password to authenticate. In older MySQL installations, it’s not uncommon to accumulate dozens of entries for the same user. Occasionally one or more of these users will have a different password from the rest.

What about roles? Well, MySQL doesn’t have them. Postgres DBAs generally recommend managing grants through role assignment. Users can be transient, so granting direct access to database objects isn’t very transferable. Imagine we have two tables out of one hundred, meant especially for reporting. We could do this in Postgres:

CREATE ROLE bi_role;
CREATE USER angie WITH PASSWORD 'sillygoose';
CREATE USER fred WITH PASSWORD 'dinotopia';
 
GRANT SELECT ON tab_a TO bi_role;
GRANT SELECT ON tab_b TO bi_role;
 
GRANT bi_role TO angie;
GRANT bi_role TO fred;

The bi_role role can be granted to anyone who takes on that job title, be it a single person, or a whole team. They all have access to the same objects, and we don’t have to micro-manage the grants to each individual user. Here’s the same thing in MySQL:

CREATE USER angie@'%' IDENTIFIED BY 'sillygoose';
CREATE USER fred@'%' IDENTIFIED BY 'dinotopia';
 
GRANT SELECT ON test.tab_a TO angie@'%';
GRANT SELECT ON test.tab_b TO angie@'%';
GRANT SELECT ON test.tab_a TO fred@'%';
GRANT SELECT ON test.tab_b TO fred@'%';

Not so bad with two tables, but this compounds geometrically as the count of tables or team-members increases. As a consequence, it’s often easier to simply grant access to everything in a database, regardless of how appropriate that may be.

Beta Bumps

Quick! Can anyone figure out when the GTID feature was added to MySQL? It’s a huge improvement and makes things like multi-master possible, as well as enabling an entire tool chain that imparts functionality like automated failovers. It’s 5.6, right?

Wrong. It’s 5.6.5.

Let that sink in. Really let it simmer. I was outright incredulous when I realized what happened here. A critical feature introduced in a dot release, something normally reserved for bug fixes and security vulnerabilities. There’s a good reason most software projects do it that way, and the question that introduced this section is the first clue.

According to the MySQL documents, 5.6.5 is a Milestone release, and this is what they say about Milestones:

This is a milestone release, for use at your own risk. Significant development changes take place in milestone releases and you may encounter compatibility issues, such as data format changes that require attention in addition to the usual procedure of running mysql_upgrade. For example, you may find it necessary to dump your data with mysqldump before the upgrade and reload it afterward.

What they don’t say is that Milestones are part of their alpha, beta, and release candidate process. The first “real” version of 5.6 was really 5.6.10. Surprise! Hopefully nobody has a legacy system running 5.6.8 somewhere, thinking it was a general release.

What version is the first version of Postgres 9.6? 9.6.0. That’s important. There’s no ambiguity in that release designation.

Am I overreacting? Probably. I’m still in a bit of shock, and first reactions aren’t always sane.

Replication

This is one place MySQL may still maintain a slight advantage over Postgres. Postgres unequivocally has more accurate and stable replication with its binary transaction log stream. A replica really is a replica in every respect, right down to which page data is written to. Except… well, there’s a question of utility.

MySQL replication is more like database-wide logical replication. It’s as if pglogical were an integral part of Postgres. Don’t want to replicate a specific database in the instance? Exclude it. Don’t need certain tables? Ignore them. This is one reason MySQL is so popular for sharding. Using this kind of partial replication, shards can be distributed across a whole fabric of MySQL servers, with designated alternate locations.

Combine that with replication chaining, and you get a circular architecture of failover candidate shard masters. It’s multi-master without an explicit need for a global transaction ID. Of course by adding such functionality in 5.6, these types of replica arrangements become more robust.

Can Postgres do anything like that? Kinda. The citus extension is probably the closest approximation regarding distribution model. Tables can be selectively broadcast to various candidate servers, and shards can have a configurable amount of spare copies spread through the cluster. But all queries have to go through the central master, as it retains the full mapping of active shard locations. It’s a way to distribute data and parallelize queries, not to multi-master.

There’s the BDR extension, but it’s a replication strategy instead of a distribution technique. Every master must contain all data in the cluster. Short of using UNLOGGED tables, there’s no exclusion process. Yes it’s multi-master, but instead of spreading 5TB across 10 nodes, we end up with 10 5TB nodes. Yuck. BDR is also incompatible with foreign data wrappers, a hallmark feature of Postgres since 9.3. In fact, there’s a whole boatload of restrictions that drastically reduce Postgres’ capabilities.

This also ignores DDL. Not a single Postgres logical replication system will properly replicate a ‘CREATE TABLE’ statement, for example. Considering these statements are in the transaction log, as they must be in order to appear on the primary system, I’m not entirely sure why this is the case. Regardless, there’s no direct equivalent to MySQL’s replication system and its relative flexibility at this time. Even though MySQL’s replication is basically just a logical replay mechanism, being tightly interwoven into the core makes a massive difference in potential.

Sunny Side

Is there any way Postgres can save me from this torment? Is there anything I wish Postgres had? Oddly enough, the answers are no and yes respectively. This is something of a surprise, as I was fully expecting to hate every aspect of MySQL after giving it up back in the 3.23 days. It still has a lot of the old gotchas, but others have naturally evolved with the platform.

User management on MySQL systems is an utter nightmare I wouldn’t wish on my worst enemy. Due to MySQL’s replication system being logical, many tools don’t work without the GTID feature. In some cases, this may mean across-the-board upgrades before reliable high availability is even an option. On extremely old platforms, it’s generally better for existing database instances to continue as MySQL systems. There’s a reason why WordPress is still officially MySQL only. So I’m stuck with these servers for the duration.

On the other hand, MySQL replication is something I wish Postgres had an equivalent for. Not because the Postgres binary approach is bad or incomplete, but because of the flexibility a fully logical approach bestows. It would be a chore to deploy and probably maintain, but it’s entirely possible to create a five node cluster and distribute ten shards across them at two shards each. Add in replication and one shard from each node could exist on two additional nodes. That’s an impressive level of redundancy. And each shard is fully read/write on their active tables, while all the replicas are read capable.

It would require a responsible access layer to decode active shard locations depending on the current mapping, but those already exist. This is definitely an area where Postgres could grow. A Postgres extension could doubtlessly make it happen if there was enough demand and backing. Or maybe there’s already something in the works that hasn’t yet been released.

Either way, I plan on getting these new MySQL servers whipped into shape so they can be left to their own devices with little further intervention. I’m still a proud Postgres DBA, after all.


Tags: , , , ,

PG Phriday: Broken Parts

October 21st, 2016 | Published in Database, Tech Talk | 1 Comment


Partitioning tables in Postgres can be an extremely risky endeavor. Unfortunately on many larger systems, it’s also essentially a requirement; the maximum size of a Postgres table is 32TB. This isn’t just because converting an existing table to a series of partitions is expensive or time consuming. We must consider how the query planner will react to the partitioned version of a table. There’s also the very real risk we will (or already have) implement flaws in the trigger or constraint logic.

Tools like pg_partman or pg_pathman are great for new systems, but existing deployments require retrofitting. So what happens if we have a large warehouse that’s already using a slightly flawed partition engine? And what other concerns remain even after achieving a pristine setup?

Let’s start with a relatively large sample table with three partitions. Each partition will consist of roughly 26M rows representing sensor readings every 100ms from 1000 sensors.

CREATE TABLE sensor_log (
  sensor_id     INT NOT NULL,
  location      VARCHAR NOT NULL,
  reading       INT NOT NULL,
  reading_date  TIMESTAMPTZ NOT NULL DEFAULT now()
);
 
CREATE TABLE sensor_log_201608 () INHERITS (sensor_log);
CREATE TABLE sensor_log_201609 () INHERITS (sensor_log);
CREATE TABLE sensor_log_201610 () INHERITS (sensor_log);
 
INSERT INTO sensor_log_201608 (sensor_id, location, reading, reading_date)
SELECT s.id % 1000, (random() * 100)::INT, s.id % 100,
       '2016-09-01'::TIMESTAMPTZ - (s.id * '100ms'::INTERVAL)
  FROM generate_series(1, 36000*24*31) s(id);
 
INSERT INTO sensor_log_201609 (sensor_id, location, reading, reading_date)
SELECT s.id % 1000, (random() * 100)::INT, s.id % 100,
       '2016-10-01'::TIMESTAMPTZ - (s.id * '100ms'::INTERVAL)
  FROM generate_series(1, 36000*24*30) s(id);
 
INSERT INTO sensor_log_201610 (sensor_id, location, reading, reading_date)
SELECT s.id % 1000, (random() * 100)::INT, s.id % 100,
       '2016-11-01'::TIMESTAMPTZ - (s.id * '100ms'::INTERVAL)
  FROM generate_series(1, 36000*24*31) s(id);
 
ALTER TABLE sensor_log_201608
  ADD CONSTRAINT ck_201608_part_reading_date
CHECK (reading_date >= '2016-08-01'::TIMESTAMPTZ AND
       reading_date < '2016-08-01'::TIMESTAMPTZ + INTERVAL '1 mon');
 
ALTER TABLE sensor_log_201609
  ADD CONSTRAINT ck_201609_part_reading_date
CHECK (reading_date >= '2016-09-01'::TIMESTAMPTZ AND
       reading_date < '2016-09-01'::TIMESTAMPTZ + INTERVAL '1 mon');
 
ALTER TABLE sensor_log_201610
  ADD CONSTRAINT ck_201609_part_reading_date
CHECK (reading_date >= '2016-09-01'::TIMESTAMPTZ AND 
       reading_date < '2016-10-01'::TIMESTAMPTZ + INTERVAL '1 mon');
 
CREATE INDEX idx_sensor_log_201608_sensor_id
    ON sensor_log_201608 (sensor_id);
CREATE INDEX idx_sensor_log_201608_reading_date
    ON sensor_log_201608 (reading_date DESC);
 
CREATE INDEX idx_sensor_log_201609_sensor_id
    ON sensor_log_201609 (sensor_id);
CREATE INDEX idx_sensor_log_201609_reading_date
    ON sensor_log_201609 (reading_date DESC);
 
CREATE INDEX idx_sensor_log_201610_sensor_id
    ON sensor_log_201610 (sensor_id);
CREATE INDEX idx_sensor_log_201610_reading_date
    ON sensor_log_201610 (reading_date DESC);
 
SET default_statistics_target = 400;
 
ANALYZE sensor_log_201608;
ANALYZE sensor_log_201609;
ANALYZE sensor_log_201610;

We’re early in the analysis and there’s already a major problem. The Postgres planner considers calculations a variable operation and substitutes them with a placeholder for all potential plans. This effectively erases the second boundary from all of our constraints. Simply because we used “+ INTERVAL ‘1 mon'” instead of a static value, every query will always target every partition.

Here’s how that looks:

EXPLAIN 
SELECT s1.*
  FROM sensor_log s1
 WHERE reading_date >= '2016-10-01'
   AND reading_date < '2016-10-21'
   AND location = '42';
 
 Append  (cost=0.00..639335.50 ROWS=164796 width=18)
   ->  Seq Scan ON sensor_log s1  (cost=0.00..0.00 ROWS=1 width=48)
         FILTER: ((reading_date >= '2016-10-01 00:00:00-05'::TIMESTAMP WITH TIME zone) AND (reading_date < '2016-10-21 00:00:00-05'::TIMESTAMP WITH TIME zone) AND ((location)::text = '42'::text))
   ->  INDEX Scan USING idx_sensor_log_201608_reading_date ON sensor_log_201608 s1_1  (cost=0.44..8.46 ROWS=1 width=18)
         INDEX Cond: ((reading_date >= '2016-10-01 00:00:00-05'::TIMESTAMP WITH TIME zone) AND (reading_date < '2016-10-21 00:00:00-05'::TIMESTAMP WITH TIME zone))
         FILTER: ((location)::text = '42'::text)
   ->  INDEX Scan USING idx_sensor_log_201609_reading_date ON sensor_log_201609 s1_2  (cost=0.44..8.46 ROWS=1 width=18)
         INDEX Cond: ((reading_date >= '2016-10-01 00:00:00-05'::TIMESTAMP WITH TIME zone) AND (reading_date < '2016-10-21 00:00:00-05'::TIMESTAMP WITH TIME zone))
         FILTER: ((location)::text = '42'::text)
   ->  Seq Scan ON sensor_log_201610 s1_3  (cost=0.00..639318.58 ROWS=164793 width=18)
         FILTER: ((reading_date >= '2016-10-01 00:00:00-05'::TIMESTAMP WITH TIME zone) AND (reading_date < '2016-10-21 00:00:00-05'::TIMESTAMP WITH TIME zone) AND ((location)::text = '42'::text))

This particular snafu invokes several related and extremely subtle problems. Consider a simple report where we want to compare sensor variance based on location. Operating under the assumption certain sensor locations don’t always trigger during certain sample periods, that’s just a single left join. If this were an inventory system, we might be comparing customers who order rubber chickens versus those who lean toward high performance graphics cards.

With our data, we can do this with a single self join, but why fetch everything? We have millions of entries going back for three months! Let’s apply a LIMIT clause so we can decide how to further analyze the results:

EXPLAIN
SELECT s1.*
  FROM sensor_log s1
  LEFT JOIN sensor_log s2 ON (
         s2.sensor_id = s1.sensor_id AND
         s2.reading_date >= '2016-10-01' AND
         s2.reading_date < '2016-10-21' AND
         s2.location = '87'
       )
 WHERE s1.reading_date >= '2016-10-01'
   AND s1.reading_date < '2016-10-21'
   AND s1.location = '42'
 LIMIT 10;
 
 LIMIT  (cost=2.73..26.59 ROWS=10 width=18)
   ->  MERGE LEFT JOIN  (cost=2.73..336756014.96 ROWS=141129646 width=18)
         MERGE Cond: (s1.sensor_id = s2.sensor_id)
         ->  MERGE Append  (cost=1.36..167319256.22 ROWS=164796 width=18)
               Sort KEY: s1.sensor_id
               ->  Sort  (cost=0.01..0.02 ROWS=1 width=48)
                     Sort KEY: s1.sensor_id
                     ->  Seq Scan ON sensor_log s1  (cost=0.00..0.00 ROWS=1 width=48)
                           FILTER: ((reading_date >= '2016-10-01 00:00:00-05'::TIMESTAMP WITH TIME zone) AND (reading_date < '2016-10-21 00:00:00-05'::TIMESTAMP WITH TIME zone) AND ((location)::text = '42'::text))
               ->  INDEX Scan USING idx_sensor_log_201608_sensor_id ON sensor_log_201608 s1_1  (cost=0.44..56414786.91 ROWS=1 width=18)
                     FILTER: ((reading_date >= '2016-10-01 00:00:00-05'::TIMESTAMP WITH TIME zone) AND (reading_date < '2016-10-21 00:00:00-05'::TIMESTAMP WITH TIME zone) AND ((location)::text = '42'::text))
               ->  INDEX Scan USING idx_sensor_log_201609_sensor_id ON sensor_log_201609 s1_2  (cost=0.44..54487523.33 ROWS=1 width=18)
                     FILTER: ((reading_date >= '2016-10-01 00:00:00-05'::TIMESTAMP WITH TIME zone) AND (reading_date < '2016-10-21 00:00:00-05'::TIMESTAMP WITH TIME zone) AND ((location)::text = '42'::text))
               ->  INDEX Scan USING idx_sensor_log_201610_sensor_id ON sensor_log_201610 s1_3  (cost=0.44..56413238.01 ROWS=164793 width=18)
                     FILTER: ((reading_date >= '2016-10-01 00:00:00-05'::TIMESTAMP WITH TIME zone) AND (reading_date < '2016-10-21 00:00:00-05'::TIMESTAMP WITH TIME zone) AND ((location)::text = '42'::text))
         ->  Materialize  (cost=1.36..167319830.26 ROWS=171278 width=4)
               ->  MERGE Append  (cost=1.36..167319402.06 ROWS=171278 width=4)
                     Sort KEY: s2.sensor_id
                     ->  Sort  (cost=0.01..0.02 ROWS=1 width=4)
                           Sort KEY: s2.sensor_id
                           ->  Seq Scan ON sensor_log s2  (cost=0.00..0.00 ROWS=1 width=4)
                                 FILTER: ((reading_date >= '2016-10-01 00:00:00-05'::TIMESTAMP WITH TIME zone) AND (reading_date < '2016-10-21 00:00:00-05'::TIMESTAMP WITH TIME zone) AND ((location)::text = '87'::text))
                     ->  INDEX Scan USING idx_sensor_log_201608_sensor_id ON sensor_log_201608 s2_1  (cost=0.44..56414786.91 ROWS=1 width=4)
                           FILTER: ((reading_date >= '2016-10-01 00:00:00-05'::TIMESTAMP WITH TIME zone) AND (reading_date < '2016-10-21 00:00:00-05'::TIMESTAMP WITH TIME zone) AND ((location)::text = '87'::text))
                     ->  INDEX Scan USING idx_sensor_log_201609_sensor_id ON sensor_log_201609 s2_2  (cost=0.44..54487523.33 ROWS=1 width=4)
                           FILTER: ((reading_date >= '2016-10-01 00:00:00-05'::TIMESTAMP WITH TIME zone) AND (reading_date < '2016-10-21 00:00:00-05'::TIMESTAMP WITH TIME zone) AND ((location)::text = '87'::text))
                     ->  INDEX Scan USING idx_sensor_log_201610_sensor_id ON sensor_log_201610 s2_3  (cost=0.44..56413238.01 ROWS=171275 width=4)
                           FILTER: ((reading_date >= '2016-10-01 00:00:00-05'::TIMESTAMP WITH TIME zone) AND (reading_date < '2016-10-21 00:00:00-05'::TIMESTAMP WITH TIME zone) AND ((location)::text = '87'::text))

There’s a lot going on here, so we’ll just focus on what went wrong. In this case, the culprit is the LIMIT clause itself. By trying to reduce output, we told Postgres to optimize the execution and avoid generating the entire result. This means it will read rows in table order until it accumulates 10 that match all of the predicates we supplied. Simple and efficient, right?

Well… no. If we look closer at one of the indexes it’s using, we’ll see it opted for the index on sensor_id instead of the index on reading_date. This isn’t necessarily a problem on the October partition, since it should match ten rows relatively quickly. If it assumes even distribution of 100 sensor locations, it may have to read up to 1000 rows to get the 10 we wanted. That’s not awful, is it?

But consider the other two partitions. Remember those? There will be no match for August or September. But Postgres doesn’t know that yet, meaning it will read the entire sensor_id index for both of them. That index is over 500MB on each partition, so we just read an extra 1GB for no reason at all. But wait! There’s more! We still need to apply the filters to remove them from consideration, right? That means a random read to the table heap for every tuple in the index, plus or minus some batching for the sake of efficiency. What’s a few hundred thousand extra random reads between friends?

Now imagine each table contains ten times more data, and there are three years worth of partitions instead of three months. Our particular example requires about six minutes to execute, but real-world cases of this can quickly cascade into hours or even days.

How can we fix this? It’s clear we need to rebuild the constraints so they properly convey the fact certain partitions should be ignored. But we already have billions of rows, and partitions are checked upon creation to verify their validity. Or are they? There is a way to automate this:

DO $$
DECLARE
  r RECORD;
BEGIN
  FOR r IN SELECT tab,
           to_date(ym, 'YYYYMM01') AS sdate,
           to_date(ym, 'YYYYMM01') + '1 mon'::INTERVAL AS edate,
           cname, col
      FROM (SELECT i.inhrelid::REGCLASS::TEXT AS tab,
                   SUBSTRING(i.inhrelid::REGCLASS::TEXT FROM '......$') AS ym,
                   co.conname AS cname,
                   SUBSTRING(co.consrc FROM '\w+') AS col
              FROM pg_inherits i
              JOIN pg_constraint co ON (co.conrelid = i.inhrelid)
             WHERE co.contype = 'c'
               AND co.consrc ILIKE '%1 mon%') cfix
  LOOP
    EXECUTE 'ALTER TABLE ' || r.tab || ' DROP CONSTRAINT ' ||
        quote_ident(r.cname);
    EXECUTE 'ALTER TABLE ' || r.tab || ' ADD CONSTRAINT ' ||
        quote_ident(r.cname) || ' CHECK (' ||
        quote_ident(r.col) || ' >= ' || quote_literal(r.sdate) || ' AND ' ||
        quote_ident(r.col) || ' < ' || quote_literal(r.edate) || ') NOT VALID';
  END LOOP;
END;
$$ LANGUAGE plpgsql;
 
UPDATE pg_constraint 
   SET convalidated = TRUE
 WHERE NOT convalidated;

The giant gross query that defines the loop simply identifies all of the broken partition constraints and provides replacement boundaries. Within the loop, we drop all of the bad constraints and replace them with better ones, being careful to tell Postgres they’re not valid so it doesn’t check them. At the end, we modify the Postgres catalog and directly mark the constraints as valid, circumventing the whole verification process entirely.

And it doesn’t matter if we have three bad constraints as in this example, or three thousand. We’ve just fixed all of them. After this, our six minute query now executes in a mere 100ms, making it roughly 4000 times faster. Remember, our example is conservative; a real warehouse would see much more striking results.

But we need to be careful. Problems like this also apply to using CURRENT_DATE or now() + INTERVAL '1 mon' or any other kind of substitution in our WHERE clauses or join conditions. With no static value, Postgres makes assumptions on the potential row counts and estimates accordingly. Sometimes these assumptions don’t change much. In others, it might mean the difference between a sequential scan and an index scan (and the associated random reads).

Index scans are usually great! Unless of course, the table has a few hundred million rows and the predicate matches 60% of them. With the variable substitution in place, Postgres will probably prefer the index scan, and our query will be much slower as a consequence. Whenever or wherever possible, use static values in predicates that touch indexes or are the primary filter condition for a query. By doing that, all of the extra conditions will still apply, but won’t adversely affect query performance.

As a side note, this is one of the other reasons indexes aren’t merely a “go faster” button. Putting them in the wrong place or invoking them improperly can make performance worse. In most cases this translates to a standard deviation or two. In others, it results in geometric or exponential increases in query times. It’s those outliers that can really ruin our day.

When partitions are concerned, it pays to be extra vigilant.


Tags: , , , , , ,

PG Phriday: Perfectly Logical

October 14th, 2016 | Published in Database, Tech Talk | 6 Comments


One of the things Postgres has been “missing” for a while is logical replication based on activity replay. Until fairly recently, in order to replicate single tables from one database to another, we had to encumber the table with performance-robbing triggers coupled to a third party daemon to manage transport. Those days might finally be behind us thanks to pglogical.

But is it easy to use? Let’s give it a try on our trusty sensor_log table. First, we’ll need to create everything on the donor node:

CREATE TABLE sensor_log (
  id            SERIAL PRIMARY KEY NOT NULL,
  location      VARCHAR NOT NULL,
  reading       BIGINT NOT NULL,
  reading_date  TIMESTAMP NOT NULL
);
 
INSERT INTO sensor_log (location, reading, reading_date)
SELECT s.id % 1000, s.id % 100,
       CURRENT_DATE - (s.id || 's')::INTERVAL
  FROM generate_series(1, 1000000) s(id);
 
CREATE EXTENSION pglogical;
 
SELECT pglogical.create_node(
    node_name := 'prod_sensors',
    dsn := 'host=localhost port=5432 dbname=postgres'
);
 
SELECT pglogical.create_replication_set(
    set_name := 'logging',
    replicate_insert := TRUE, replicate_update := FALSE,
    replicate_delete := FALSE, replicate_truncate := FALSE
);
 
SELECT pglogical.replication_set_add_table(
    set_name := 'logging', relation := 'sensor_log', 
    synchronize_data := TRUE
);

After all of that, we have a replication set containing a single table representing one million rows of sensor data. At this point, any number of subscribers could connect to the replication set and request its contents. After that, all inserts would also be replayed on the subscriber once they are detected in the Postgres transaction log.

In a second Postgres instance, we would do this to “consume” the table:

CREATE TABLE sensor_log (
  id            INT PRIMARY KEY NOT NULL,
  location      VARCHAR NOT NULL,
  reading       BIGINT NOT NULL,
  reading_date  TIMESTAMP NOT NULL
);
 
CREATE EXTENSION pglogical;
 
SELECT pglogical.create_node(
    node_name := 'sensor_warehouse',
    dsn := 'host=localhost port=5999 dbname=postgres'
);
 
SELECT pglogical.create_subscription(
    subscription_name := 'wh_sensor_data',
    replication_sets := array['logging'],
    provider_dsn := 'host=localhost port=5432 dbname=postgres'
);
 
SELECT pg_sleep(5);
 
SELECT COUNT(*) FROM sensor_log;
 
  COUNT  
---------
 1000000

We actually did a few things here, so let’s break it down. We began by creating the table structure itself. This will act as a container for the incoming data. Then we created the node like we did on the origin, and then subscribed to the provider itself.

We should note that it wasn’t strictly required to create the table beforehand. The create_subscription function has a parameter called synchronize_structure that uses pg_dump to obtain the table DDL during the subscription phase for replay on the subscriber. Unfortunately, this operation does not restrict itself to the tables in the replication set for some reason. As a result, any conflicting objects in the existing schema will cause the table import to fail.

Once we’ve established the subscription, we merely need to wait a few seconds for the initial data copy to complete. This is only one million rows, so we don’t have to wait very long. After that, all subsequent inserts should also show up on this subscriber.

One really cool thing about pglogical is that it takes advantage of Postgres 9.4+ background workers. This means there’s no external daemon sitting around watching a queue, or the transaction logs, or any other source. There’s an actual Postgres backend supervising the Postgres replication stream, and it will capture applicable content for the sensor_log table. The extension becomes the management daemon as if it were a native Postgres feature.

And it gets even better. For those of us that rely on large warehouses that might accumulate data from one or more active production locations for several years, this is something of a golden fleece. Trigger-based replication falls flat here because synchronization means synchronization. If we need a table to just sit and accumulate data, it was ETL, ad-hoc copy scripts, or nothing. But what if we set up our sensor_log table on the subscriber just a bit differently?

SELECT pglogical.drop_subscription(
    subscription_name := 'wh_sensor_data'
);
 
TRUNCATE TABLE sensor_log;
 
CREATE TABLE sensor_log_part (
    LIKE sensor_log INCLUDING INDEXES
) INHERITS (sensor_log);
 
CREATE OR REPLACE FUNCTION f_redirect_sensor_log()
RETURNS TRIGGER 
AS $$
BEGIN
  INSERT INTO sensor_log_part VALUES (NEW.*);
  PERFORM 1;
  RETURN NULL;
END;
$$ LANGUAGE plpgsql;
 
CREATE TRIGGER t_redirect_sensor_log_bi
BEFORE INSERT ON sensor_log
   FOR EACH ROW
       EXECUTE PROCEDURE f_redirect_sensor_log();
 
ALTER TABLE sensor_log
      ENABLE ALWAYS TRIGGER t_redirect_sensor_log_bi;
 
SELECT pglogical.create_subscription(
    subscription_name := 'wh_sensor_data',
    replication_sets := array['logging'],
    provider_dsn := 'host=localhost port=5432 dbname=postgres'
);
 
SELECT pg_sleep(5);
 
SELECT COUNT(*) FROM ONLY sensor_log;
 
 COUNT 
-------
     0
 
SELECT COUNT(*) FROM sensor_log;
 
  COUNT  
---------
 1000000

This is where the magic really resides. Instead of inserting data into the table itself, we were able to redirect it into a table partition. This is important, because some of the larger Postgres warehouses rely on partitions to distribute maintenance and risk. Instead of one single 4TB table, we might have a dozen 350GB tables.

This proof of concept suggests pglogical is compatible with existing partition systems. And that is important if we’re trying to simplify our stack by removing complex ETL processes. Since we created the replication set to only capture insert activity, we can purge the origin of old data as frequently as we wish, and it will remain in the warehouse indefinitely.

This comes with one caveat, though. Notice that last ALTER TABLE statement where we marked our trigger to always fire? This is necessary because pglogical uses copy mechanisms that otherwise circumvent trigger logic. If we didn’t include that statement, the sync would have dumped all of the rows into the base sensor_log table instead of our target partition.

This last detail raises an important question: are common partition management extensions like pg_partman really compatible with this use case? Automated partition management libraries must regularly modify the underlying trigger to ensure current data is redirected into the appropriate partition. In this case, the answer is highly reliant on how the triggers are redefined.

If partition management systems simply replace the underlying function, all is safe. The trigger definition itself was not dropped and recreated without the ALWAYS modification. This would be the preferred method since it doesn’t require a lock on the table to recreate the trigger definition. Unfortunately this is only an assumption. Using the “correct” approach also doesn’t prevent the extra step of manually marking the triggers as ALWAYS post-creation.

For pg_partman, we’d do something like this after initial partition setup on the warehouse node:

DO $$
DECLARE
    TABLE_NAME VARCHAR;
    trigger_name VARCHAR;
BEGIN
    FOR TABLE_NAME, trigger_name IN
        SELECT tgrelid::regclass::text, tgname
          FROM pg_trigger
         WHERE tgname LIKE '%\_part\_trig'
    LOOP
        EXECUTE 'ALTER TABLE ' || TABLE_NAME || 
                ' ENABLE ALWAYS TRIGGER ' || trigger_name;
    END LOOP;
END;
$$ LANGUAGE plpgsql;

Gross. This also precludes any new tables from being added to the subscription set without this kind of interception. It would be better if popular partition libraries offered a parameter for always enforcing triggers. Or perhaps a more universal approach like some kind of post-definition hook for executing an arbitrary function that could handle advanced use cases. I promise I’m not picking on pg_partman! This applies even to not-so-popular partition extensions like my own tab_tier.

It’s a crazy world out there in extension land. Apparently we have to keep our eyes peeled for exciting new functionality like those pglogical introduces, and any of the ensuing implications for potentially related extensions. Either way, our warehouses are happier for it!


Tags: , , , , , ,

PG Phriday: Pesky Partition Plans

October 7th, 2016 | Published in Database, Tech Talk | No Comments


For all of those warehouse queries that never seem to complete before the heat death of the universe, there’s often a faster version. Sometimes this is due to a fundamental misunderstanding of how queries work, or how Postgres specifically functions. The trick is knowing when to back away slowly from an ugly but efficient query, and when to inject a flurry of predicates to fully illustrate the original intent of the query so the planner makes better decisions. When partitions are involved, this can be quite an adventure.

Sometimes optimizing partitioned plans is an art. Other times, things are a little easier to isolate. Imagine we have a sensor log partitioned by month, and our app is fairly new, so there are only three months available. Anything illustrated in this case would only be much worse as time progresses.

Just to make things simple, there will be no trigger logic or anything to actually move rows around. We’re just looking at query performance, not the feasibility of the structure itself.

CREATE TABLE sensor_log (
  sensor_id     INT NOT NULL,
  location      VARCHAR NOT NULL,
  reading       INT NOT NULL,
  reading_date  TIMESTAMPTZ NOT NULL DEFAULT now()
);
 
CREATE TABLE sensor_log_201607 () INHERITS (sensor_log);
CREATE TABLE sensor_log_201608 () INHERITS (sensor_log);
CREATE TABLE sensor_log_201609 () INHERITS (sensor_log);
 
INSERT INTO sensor_log_201607 (sensor_id, location, reading, reading_date)
SELECT s.id % 1000, (random() * 100)::INT, s.id % 100,
       '2016-08-01'::TIMESTAMPTZ - (s.id || 's')::INTERVAL
  FROM generate_series(1, 3600*24*30) s(id);
 
INSERT INTO sensor_log_201608 (sensor_id, location, reading, reading_date)
SELECT s.id % 1000, (random() * 100)::INT, s.id % 100,
       '2016-09-01'::TIMESTAMPTZ - (s.id || 's')::INTERVAL
  FROM generate_series(1, 3600*24*30) s(id);
 
INSERT INTO sensor_log_201609 (sensor_id, location, reading, reading_date)
SELECT s.id % 1000, (random() * 100)::INT, s.id % 100,
       '2016-10-01'::TIMESTAMPTZ - (s.id || 's')::INTERVAL
  FROM generate_series(1, 3600*24*30) s(id);
 
ALTER TABLE sensor_log_201607
  ADD CONSTRAINT ck_201607_part_reading_date
CHECK (reading_date >= '2016-07-01' AND reading_date < '2016-08-01');
 
ALTER TABLE sensor_log_201608
  ADD CONSTRAINT ck_201608_part_reading_date
CHECK (reading_date >= '2016-08-01' AND reading_date < '2016-09-01');
 
ALTER TABLE sensor_log_201609
  ADD CONSTRAINT ck_201609_part_reading_date
CHECK (reading_date >= '2016-09-01' AND reading_date < '2016-10-01');
 
CREATE INDEX idx_sensor_log_201607_sensor_id
    ON sensor_log_201607 (sensor_id);
CREATE INDEX idx_sensor_log_201607_reading_date
    ON sensor_log_201607 (reading_date DESC);
 
CREATE INDEX idx_sensor_log_201608_sensor_id
    ON sensor_log_201608 (sensor_id);
CREATE INDEX idx_sensor_log_201608_reading_date
    ON sensor_log_201608 (reading_date DESC);
 
CREATE INDEX idx_sensor_log_201609_sensor_id
    ON sensor_log_201609 (sensor_id);
CREATE INDEX idx_sensor_log_201609_reading_date
    ON sensor_log_201609 (reading_date DESC);
 
ANALYZE sensor_log_201607;
ANALYZE sensor_log_201608;
ANALYZE sensor_log_201609;

With these three (not including the root) tables in place, a query can target the base table and operate as if the partitioning didn’t exist. There isn’t anything new in that regard, but it’s the first place where a business intelligence operative might stumble. The base table can obscure potentially hundreds of other segments and the associated overhead.

This is especially concerning in cumulative warehouse archives where there are no primary keys. In these contexts, data is unique per key and date combination. The date isn’t simply a manner of isolating specific ranges for comparative aggregates.

On operator might understand that smaller query result sets make for fewer index matches and faster execution times. So maybe they try to isolate in a step-by-step manner with temporary tables or CTEs. Using that as a foundation, they produce this query:

EXPLAIN
WITH SOURCE AS (
    SELECT *
      FROM sensor_log
     WHERE reading_date >= '2016-09-11'
       AND reading_date < '2016-09-18'
       AND location = '5'
)
SELECT SOURCE.*, target.*
  FROM SOURCE 
  LEFT JOIN sensor_log target ON (
         SOURCE.sensor_id = target.sensor_id AND
         target.location = '10'
       );
 
 MERGE LEFT JOIN  (cost=178118.79..212683.46 ROWS=2302397 width=66)
   MERGE Cond: (SOURCE.sensor_id = target.sensor_id)
   CTE SOURCE
     ->  Append  (cost=0.00..24384.21 ROWS=5743 width=18)
           ->  Seq Scan ON sensor_log  (cost=0.00..0.00 ROWS=1 width=48)
                 FILTER: ((reading_date >= '2016-09-11 00:00:00-05'::TIMESTAMP WITH TIME zone) AND (reading_date < '2016-09-18 00:00:00-05'::TIMESTAMP WITH TIME zone) AND ((location)::text = '5'::text))
           ->  INDEX Scan USING idx_sensor_log_201609_reading_date ON sensor_log_201609  (cost=0.43..24384.21 ROWS=5742 width=18)
                 INDEX Cond: ((reading_date >= '2016-09-11 00:00:00-05'::TIMESTAMP WITH TIME zone) AND (reading_date < '2016-09-18 00:00:00-05'::TIMESTAMP WITH TIME zone))
                 FILTER: ((location)::text = '5'::text)
   ->  Sort  (cost=473.44..487.80 ROWS=5743 width=48)
         Sort KEY: SOURCE.sensor_id
         ->  CTE Scan ON SOURCE  (cost=0.00..114.86 ROWS=5743 width=48)
   ->  Sort  (cost=153261.13..153461.58 ROWS=80181 width=18)
         Sort KEY: target.sensor_id
         ->  Append  (cost=0.00..146730.00 ROWS=80181 width=18)
               ->  Seq Scan ON sensor_log target  (cost=0.00..0.00 ROWS=1 width=48)
                     FILTER: ((location)::text = '10'::text)
               ->  Seq Scan ON sensor_log_201607 target_1  (cost=0.00..48910.00 ROWS=26352 width=18)
                     FILTER: ((location)::text = '10'::text)
               ->  Seq Scan ON sensor_log_201608 target_2  (cost=0.00..48910.00 ROWS=27562 width=18)
                     FILTER: ((location)::text = '10'::text)
               ->  Seq Scan ON sensor_log_201609 target_3  (cost=0.00..48910.00 ROWS=26266 width=18)
                     FILTER: ((location)::text = '10'::text)

In this example, the CTE simulates just some other random table joining into our partitioned sample. Notice those last three elements where the planner decides to perform a sequential scan on all of the partitions? It’s not entirely naive to think date constraints are transitive when working with tables that are commonly unique. Nor is it strictly common knowledge that CTEs act as optimization fences that completely isolate the internal query. But in large archives that trend toward accumulation, those assumptions can be extremely dangerous.

Let’s say we know about the optimization fence and rewrite the query, retaining our naive assumption that date constraints are shared by the join:

EXPLAIN
SELECT SOURCE.*, target.*
  FROM sensor_log SOURCE 
  LEFT JOIN sensor_log target ON (
         SOURCE.sensor_id = target.sensor_id AND
         target.location = '10'
       )
 WHERE SOURCE.reading_date >= '2016-09-11'
   AND SOURCE.reading_date < '2016-09-18'
   AND SOURCE.location = '5';
 
 MERGE LEFT JOIN  (cost=178003.93..212568.60 ROWS=2302397 width=36)
   MERGE Cond: (SOURCE.sensor_id = target.sensor_id)
   ->  Sort  (cost=24742.79..24757.15 ROWS=5743 width=18)
         Sort KEY: SOURCE.sensor_id
         ->  Append  (cost=0.00..24384.21 ROWS=5743 width=18)
               ->  Seq Scan ON sensor_log SOURCE  (cost=0.00..0.00 ROWS=1 width=48)
                     FILTER: ((reading_date >= '2016-09-11 00:00:00-05'::TIMESTAMP WITH TIME zone) AND (reading_date < '2016-09-18 00:00:00-05'::TIMESTAMP WITH TIME zone) AND ((location)::text = '5'::text))
               ->  INDEX Scan USING idx_sensor_log_201609_reading_date ON sensor_log_201609 source_1  (cost=0.43..24384.21 ROWS=5742 width=18)
                     INDEX Cond: ((reading_date >= '2016-09-11 00:00:00-05'::TIMESTAMP WITH TIME zone) AND (reading_date < '2016-09-18 00:00:00-05'::TIMESTAMP WITH TIME zone))
                     FILTER: ((location)::text = '5'::text)
   ->  Sort  (cost=153261.13..153461.58 ROWS=80181 width=18)
         Sort KEY: target.sensor_id
         ->  Append  (cost=0.00..146730.00 ROWS=80181 width=18)
               ->  Seq Scan ON sensor_log target  (cost=0.00..0.00 ROWS=1 width=48)
                     FILTER: ((location)::text = '10'::text)
               ->  Seq Scan ON sensor_log_201607 target_1  (cost=0.00..48910.00 ROWS=26352 width=18)
                     FILTER: ((location)::text = '10'::text)
               ->  Seq Scan ON sensor_log_201608 target_2  (cost=0.00..48910.00 ROWS=27562 width=18)
                     FILTER: ((location)::text = '10'::text)
               ->  Seq Scan ON sensor_log_201609 target_3  (cost=0.00..48910.00 ROWS=26266 width=18)
                     FILTER: ((location)::text = '10'::text)

As expected, the planner properly applied the date range to the source table, but not the target, and we end up with sequential scans on all of the partitions. Obvious, right? Maybe not. It’s actually not uncommon to perform cross data comparisons with current readings using multiple fragments of composite keys. When operating like this, it’s easy to forget the date constraints are equally critical, especially with partitioned tables that can span for years and represent several TB of data.

Literally every human being that has touched a warehouse in my presence has made this mistake—including myself. Whether it’s a date, ID, or some other value, conditionals shouldn’t be assumed. Nor are they transitive, even if the database engine supports that kind of predicate migration.

Here’s what the query and its plan should resemble:

EXPLAIN
SELECT SOURCE.*, target.*
  FROM sensor_log SOURCE 
  LEFT JOIN sensor_log target ON (
         SOURCE.sensor_id = target.sensor_id AND
         target.location = '10' AND
         target.reading_date >= '2016-09-11' AND
         target.reading_date < '2016-09-18'
       )
 WHERE SOURCE.reading_date >= '2016-09-11'
   AND SOURCE.reading_date < '2016-09-18'
   AND SOURCE.location = '5';
 
 MERGE LEFT JOIN  (cost=49517.62..52217.26 ROWS=178062 width=36)
   MERGE Cond: (SOURCE.sensor_id = target.sensor_id)
   ->  Sort  (cost=24742.79..24757.15 ROWS=5743 width=18)
         Sort KEY: SOURCE.sensor_id
         ->  Append  (cost=0.00..24384.21 ROWS=5743 width=18)
               ->  Seq Scan ON sensor_log SOURCE  (cost=0.00..0.00 ROWS=1 width=48)
                     FILTER: ((reading_date >= '2016-09-11 00:00:00-05'::TIMESTAMP WITH TIME zone) AND (reading_date < '2016-09-18 00:00:00-05'::TIMESTAMP WITH TIME zone) AND ((location)::text = '5'::text))
               ->  INDEX Scan USING idx_sensor_log_201609_reading_date ON sensor_log_201609 source_1  (cost=0.43..24384.21 ROWS=5742 width=18)
                     INDEX Cond: ((reading_date >= '2016-09-11 00:00:00-05'::TIMESTAMP WITH TIME zone) AND (reading_date < '2016-09-18 00:00:00-05'::TIMESTAMP WITH TIME zone))
                     FILTER: ((location)::text = '5'::text)
   ->  Sort  (cost=24774.82..24790.32 ROWS=6201 width=18)
         Sort KEY: target.sensor_id
         ->  Append  (cost=0.00..24384.21 ROWS=6201 width=18)
               ->  Seq Scan ON sensor_log target  (cost=0.00..0.00 ROWS=1 width=48)
                     FILTER: ((reading_date >= '2016-09-11 00:00:00-05'::TIMESTAMP WITH TIME zone) AND (reading_date < '2016-09-18 00:00:00-05'::TIMESTAMP WITH TIME zone) AND ((location)::text = '10'::text))
               ->  INDEX Scan USING idx_sensor_log_201609_reading_date ON sensor_log_201609 target_1  (cost=0.43..24384.21 ROWS=6200 width=18)
                     INDEX Cond: ((reading_date >= '2016-09-11 00:00:00-05'::TIMESTAMP WITH TIME zone) AND (reading_date < '2016-09-18 00:00:00-05'::TIMESTAMP WITH TIME zone))
                     FILTER: ((location)::text = '10'::text)

And there we have it, the partitioned table results are restricted specifically to equivalent date ranges. And those date clauses we added to the join must exist for any similarly partitioned table unless the intent really is to compare with all historical records in all partitions. And that’s not exactly recommended either. That kind of analysis really should be done with fact tables and data cubes, which are targeted toward aggregate use much better than raw results.

Just be cognizant of partitioned tables. Quite a few people claim there are better ways to implement them in Postgres provided some underlying code changes are committed. Until that time, it’s critical to always include the partition column in every query, join, sub-clause, partridge, and pear tree. It’s how Postgres decides which partitions to include in results, and often enough, which indexes to consider.

The alternative is having to wait for hours or days while Postgres checks every existing partition of every included table for the search criteria, and nobody wants that.

Well… except attackers who are trying to DOS your database.


Tags: , , ,

« Older Posts

Newer Posts »