PG Phriday: JOIN the Club

Friday April 08, 2016

I’ve been going on and on about esoteric Postgres features for so long, sometimes I forget my own neophyte beginnings. So let’s back way off and talk about how JOINs work. What are they? What do they do? Why would you use one? What can go wrong? It’s easier than you might imagine, but there are definitely some subtleties to account for.

To make this more relatable, let’s start with a club of pet owners that likes to keep track of everything:

CREATE TABLE pet_type
(
  type_id    SERIAL PRIMARY KEY,
  type_name  VARCHAR NOT NULL
);

CREATE TABLE pet_owner
(
  owner_id    SERIAL PRIMARY KEY,
  owner_name  VARCHAR NOT NULL
);

CREATE TABLE pet_record
(
  record_id  SERIAL PRIMARY KEY,
  type_id    INT NOT NULL REFERENCES pet_type (type_id),
  owner_id   INT NOT NULL REFERENCES pet_owner (owner_id),
  pet_name   VARCHAR NOT NULL
);

INSERT INTO pet_type (type_name)
VALUES ('Cat'), ('Dog'), ('Horse');

INSERT INTO pet_owner (owner_name)
VALUES ('Catherine L. Ady'), ('Peter Venkman'), ('Wilbur Post');

INSERT INTO pet_record (type_id, owner_id, pet_name)
VALUES (1, 1, 'Fluffy'), (1, 1, 'Mr. Whiskers'),
       (1, 2, 'Garfield'), (2, 2, 'Spinee'),
       (3, 3, 'Mr. Ed');

Now that we have some data, we want to retrieve it in a way that’s relevant and descriptive. I could go on a long and exhausting description regarding JOIN theory, but those already exist and probably do a better job anyway. Instead, I’ll just say there are two types of JOIN: INNER and OUTER.

INNER JOINs are the default in Postgres. Because it’s the default, we don’t actually need to use the INNER keyword. Legacy systems don’t always allow this, so some ORMs and end users still use INNER JOIN explicitly, but it isn’t necessary or even encouraged with Postgres. The INNER keyword simply means that rows must exist in both tables to appear in the results.

For example, let’s list all pets and their pet type:

SELECT p.pet_name, t.type_name
  FROM pet_record p
  JOIN pet_type t ON (t.type_id = p.type_id);

   pet_name   | type_name 
--------------+-----------
 Fluffy       | Cat
 Mr. Whiskers | Cat
 Garfield     | Cat
 Spinee       | Dog
 Mr. Ed       | Horse

This is the least complicated type of JOIN there is. In this particular case, the type_id column exists in both pet_type and pet_record. We can use that information to join the tables together so they resemble a single table. From that pool of columns, we can pick and choose the ones that interest us in the SELECT section.

Postgres also supports some shorthand syntax that simplifies the process:

SELECT p.pet_name, t.type_name
  FROM pet_record p
  JOIN pet_type t USING (type_id);

The primary caveat to USING is that the JOIN columns aren’t explicitly listed. If there are multiple tables with similarly named columns involved in a JOIN, Postgres might complain about ambiguity, and then we’d have to revert to the regular syntax. Still, it’s a nice shortcut for most queries.

If that’s an INNER JOIN, what then is an OUTER JOIN? This is a bit more difficult to visualize, but the logic behind it is this: if a row from one table does not exist in the other, represent the missing data as NULL. In other words, since we’re combining two tables to be a larger, virtual table, don’t remove records simply because they’re missing.

As an example, let’s add another pet type and then count how many pets of each type exist. First with a regular JOIN, and then with an OUTER JOIN:

INSERT INTO pet_type (type_name) VALUES ('Cow');

-- First with a standard JOIN

SELECT t.type_name, count(p.record_id) AS total
  FROM pet_type t
  JOIN pet_record p USING (type_id)
 GROUP BY t.type_name;

 type_name | total 
-----------+-------
 Cat       |     3
 Horse     |     1
 Dog       |     1

-- Now use an OUTER JOIN

SELECT t.type_name, count(p.record_id) AS total
  FROM pet_type t
  LEFT JOIN pet_record p USING (type_id)
 GROUP BY t.type_name;

 type_name | total 
-----------+-------
 Cow       |     0
 Cat       |     3
 Horse     |     1
 Dog       |     1

-- Here's the "table" the OUTER JOIN made

SELECT t.*, p.*
  FROM pet_type t
  LEFT JOIN pet_record p USING (type_id);

 type_id | type_name | record_id | type_id | owner_id |   pet_name   
---------+-----------+-----------+---------+----------+--------------
       1 | Cat       |         2 |       1 |        1 | Fluffy
       1 | Cat       |         3 |       1 |        1 | Mr. Whiskers
       1 | Cat       |         4 |       1 |        2 | Garfield
       2 | Dog       |         5 |       2 |        2 | Spinee
       3 | Horse     |         6 |       3 |        3 | Mr. Ed
       4 | Cow       |           |         |          |

The idea here is that nobody owns a cow, but we still want the pet type listed so we can see that nobody owns a cow. Otherwise, we might not even consider that owning a cow is even an option. But we didn’t even use the OUTER keyword! What’s with this LEFT nonsense? It just so happens that, like INNER, the OUTER keyword is also optional in Postgres. What actually makes a JOIN an OUTER JOIN are the LEFT, RIGHT, and FULL keywords.

Looking at the virtual table produced by the JOIN itself, it’s clear that we only have information supplied by the table that was on the left side of the join. Since this is an OUTER join type, Postgres substituted a bunch of blank data for the missing row in the pet_record table on the right, since its data was optional. If we wanted to represent the same query as a RIGHT join, we just need to move things around a bit:

SELECT t.*, p.*
  FROM pet_record p
  RIGHT JOIN pet_type t USING (type_id);

Now the records on the right are the source of the JOIN and those on the left are represented by NULL if they don’t exist. Otherwise, the output is exactly the same as before. Because of this, it’s fairly uncommon to encounter RIGHT JOIN in the wild. Western countries use writing systems that read left-to-right, so it’s likely joins will also follow that trend.

In that case, how is FULL different? Combine the idea of LEFT and RIGHT, and you get FULL. Essentially when joining two tables with FULL, rows that don’t exist on the right are represented by NULL, and rows that don’t exist on the left are NULL as well. This kind of output makes sense when combining reports or other aggregated output, since some summaries won’t have an equivalent on either side, but we still want the data.

In instances like our relational diagram, it’s not possible to observe the result of a FULL JOIN on the IDs because at least one ID always exists. But… we can fake it by joining on a column they don’t share, and never matches:

SELECT t.*, p.*
  FROM pet_record p
  FULL JOIN pet_type t ON (p.pet_name = t.type_name);

 type_id | type_name | record_id | type_id | owner_id |   pet_name   
---------+-----------+-----------+---------+----------+--------------
       1 | Cat       |           |         |          | 
       4 | Cow       |           |         |          | 
       2 | Dog       |           |         |          | 
         |           |         2 |       1 |        1 | Fluffy
         |           |         4 |       1 |        2 | Garfield
       3 | Horse     |           |         |          | 
         |           |         6 |       3 |        3 | Mr. Ed
         |           |         3 |       1 |        1 | Mr. Whiskers
         |           |         5 |       2 |        2 | Spinee

It’s like we just haphazardly glued the tables together into one giant table. But examining the output itself, the NULL values always occur when there’s no match for the column(s) we used for the join itself. Thus FULL is operating exactly as advertised.

OUTER JOIN is normally used for optional relationships, like in our example where there might be a type of pet nobody owns. It can also be used to find club members that currently own no pets. The point here is that a regular JOIN would exclude these situations and potentially produce erroneous data in a report. OUTER JOIN fills a very defined role, and even though it’s not as dominant, it’s indispensable.

As a final note on JOINs, there is one easily made mistake that generally results in something of a disaster. Imagine we wanted to know who owned each pet, and we rush the query with some cut-and-paste magic. This is what we find:

SELECT o.owner_name, p.pet_name, t.type_name
  FROM pet_record p
  JOIN pet_type t ON (t.type_id = p.type_id)
  JOIN pet_owner o ON (t.type_id = p.type_id);

    owner_name    |   pet_name   | type_name 
------------------+--------------+-----------
 Catherine L. Ady | Fluffy       | Cat
 Catherine L. Ady | Mr. Whiskers | Cat
 Catherine L. Ady | Garfield     | Cat
 Catherine L. Ady | Spinee       | Dog
 Catherine L. Ady | Mr. Ed       | Horse
 Peter Venkman    | Fluffy       | Cat
 Peter Venkman    | Mr. Whiskers | Cat
 Peter Venkman    | Garfield     | Cat
 Peter Venkman    | Spinee       | Dog
 Peter Venkman    | Mr. Ed       | Horse
 Wilbur Post      | Fluffy       | Cat
 Wilbur Post      | Mr. Whiskers | Cat
 Wilbur Post      | Garfield     | Cat
 Wilbur Post      | Spinee       | Dog
 Wilbur Post      | Mr. Ed       | Horse

Oh no! There’s no way every pet is owned by every person! What we have here is a Cartesian Product, and is the bane of databases everywhere. In our case, it happened because we joined the pet_owner table on type_id instead of owner_id. Since the join clause can be anything, Postgres dutifully joined every row in the owner table to every row in the record table. This operation is multiplicative, and so the complexity grows exponentially based on the size of the source tables.

We got off lucky because our tables are small example sets. Imagine what would have happened if we were joining two million-row tables! Producing, transmitting, and receiving one trillion rows instead of a million might just cause some problems. Even small results of a couple thousand rows can be amplified tremendously into millions of results due to a malformed JOIN. However, this example actually provides some valuable insight about what’s going on during the join itself.

Consider the overzealous output again. The rows we want are there, it’s just that the invalid ones weren’t pruned before they were returned to us. There are clearly some efficiency models in place to avoid exhaustive combinations, but this really is how joins work. They combine every row from one table with every row in another table, and weed out the results that don’t match the JOIN criteria. Because of that, the output can be anything from a boring and expected list of pet owners and their pets, or a nonsensical combination of everything. It’s all in the clauses.

Here’s what would have happened if we didn’t suffer from cut-and-paste-itis:

SELECT o.owner_name, p.pet_name, t.type_name
  FROM pet_record p
  JOIN pet_type t ON (t.type_id = p.type_id)
  JOIN pet_owner o ON (o.owner_id = p.owner_id);

    owner_name    |   pet_name   | type_name 
------------------+--------------+-----------
 Catherine L. Ady | Fluffy       | Cat
 Catherine L. Ady | Mr. Whiskers | Cat
 Peter Venkman    | Garfield     | Cat
 Peter Venkman    | Spinee       | Dog
 Wilbur Post      | Mr. Ed       | Horse

Much better. This is one of the main reasons databases like Postgres are so versatile. Not only will they give you exactly what you ask for, but the diversity of the results covers much of what is possible through Set Theory. The mathematics of Set Theory are justifiably impenetrable to someone without a degree in Mathematics—and even some that do—yet the power lurking there is undeniable. For everyone else, the basics are more than sufficient.

That kind of adaptability is a double-edged sword to be sure, but knowing the pitfalls is the best way to avoid them. Postgres is one of those things that’s easy to use and hard to master precisely because of the Math it represents. Now that you have some of the tools necessary to use Postgres, feel free to try and master it.

I still haven’t, but that doesn’t mean I won’t stop trying.

See Also