PG Phriday: JOIN the Club
I’ve been going on and on about esoteric Postgres features for so long, sometimes I forget my own neophyte beginnings. So let’s back way off and talk about how JOINs work. What are they? What do they do? Why would you use one? What can go wrong? It’s easier than you might imagine, but there are definitely some subtleties to account for.
To make this more relatable, let’s start with a club of pet owners that likes to keep track of everything:
CREATE TABLE pet_type
(
type_id SERIAL PRIMARY KEY,
type_name VARCHAR NOT NULL
);
CREATE TABLE pet_owner
(
owner_id SERIAL PRIMARY KEY,
owner_name VARCHAR NOT NULL
);
CREATE TABLE pet_record
(
record_id SERIAL PRIMARY KEY,
type_id INT NOT NULL REFERENCES pet_type (type_id),
owner_id INT NOT NULL REFERENCES pet_owner (owner_id),
pet_name VARCHAR NOT NULL
);
INSERT INTO pet_type (type_name)
VALUES ('Cat'), ('Dog'), ('Horse');
INSERT INTO pet_owner (owner_name)
VALUES ('Catherine L. Ady'), ('Peter Venkman'), ('Wilbur Post');
INSERT INTO pet_record (type_id, owner_id, pet_name)
VALUES (1, 1, 'Fluffy'), (1, 1, 'Mr. Whiskers'),
(1, 2, 'Garfield'), (2, 2, 'Spinee'),
(3, 3, 'Mr. Ed');
Now that we have some data, we want to retrieve it in a way that’s relevant and descriptive. I could go on a long and exhausting description regarding JOIN
theory, but those already exist and probably do a better job anyway. Instead, I’ll just say there are two types of JOIN
: INNER
and OUTER
.
INNER JOINs are the default in Postgres. Because it’s the default, we don’t actually need to use the INNER
keyword. Legacy systems don’t always allow this, so some ORMs and end users still use INNER JOIN
explicitly, but it isn’t necessary or even encouraged with Postgres. The INNER
keyword simply means that rows must exist in both tables to appear in the results.
For example, let’s list all pets and their pet type:
SELECT p.pet_name, t.type_name
FROM pet_record p
JOIN pet_type t ON (t.type_id = p.type_id);
pet_name | type_name
--------------+-----------
Fluffy | Cat
Mr. Whiskers | Cat
Garfield | Cat
Spinee | Dog
Mr. Ed | Horse
This is the least complicated type of JOIN
there is. In this particular case, the type_id
column exists in both pet_type
and pet_record
. We can use that information to join the tables together so they resemble a single table. From that pool of columns, we can pick and choose the ones that interest us in the SELECT
section.
Postgres also supports some shorthand syntax that simplifies the process:
SELECT p.pet_name, t.type_name
FROM pet_record p
JOIN pet_type t USING (type_id);
The primary caveat to USING
is that the JOIN
columns aren’t explicitly listed. If there are multiple tables with similarly named columns involved in a JOIN
, Postgres might complain about ambiguity, and then we’d have to revert to the regular syntax. Still, it’s a nice shortcut for most queries.
If that’s an INNER JOIN
, what then is an OUTER JOIN
? This is a bit more difficult to visualize, but the logic behind it is this: if a row from one table does not exist in the other, represent the missing data as NULL
. In other words, since we’re combining two tables to be a larger, virtual table, don’t remove records simply because they’re missing.
As an example, let’s add another pet type and then count how many pets of each type exist. First with a regular JOIN
, and then with an OUTER JOIN
:
INSERT INTO pet_type (type_name) VALUES ('Cow');
-- First with a standard JOIN
SELECT t.type_name, count(p.record_id) AS total
FROM pet_type t
JOIN pet_record p USING (type_id)
GROUP BY t.type_name;
type_name | total
-----------+-------
Cat | 3
Horse | 1
Dog | 1
-- Now use an OUTER JOIN
SELECT t.type_name, count(p.record_id) AS total
FROM pet_type t
LEFT JOIN pet_record p USING (type_id)
GROUP BY t.type_name;
type_name | total
-----------+-------
Cow | 0
Cat | 3
Horse | 1
Dog | 1
-- Here's the "table" the OUTER JOIN made
SELECT t.*, p.*
FROM pet_type t
LEFT JOIN pet_record p USING (type_id);
type_id | type_name | record_id | type_id | owner_id | pet_name
---------+-----------+-----------+---------+----------+--------------
1 | Cat | 2 | 1 | 1 | Fluffy
1 | Cat | 3 | 1 | 1 | Mr. Whiskers
1 | Cat | 4 | 1 | 2 | Garfield
2 | Dog | 5 | 2 | 2 | Spinee
3 | Horse | 6 | 3 | 3 | Mr. Ed
4 | Cow | | | |
The idea here is that nobody owns a cow, but we still want the pet type listed so we can see that nobody owns a cow. Otherwise, we might not even consider that owning a cow is even an option. But we didn’t even use the OUTER
keyword! What’s with this LEFT
nonsense? It just so happens that, like INNER
, the OUTER
keyword is also optional in Postgres. What actually makes a JOIN
an OUTER JOIN
are the LEFT
, RIGHT
, and FULL
keywords.
Looking at the virtual table produced by the JOIN
itself, it’s clear that we only have information supplied by the table that was on the left side of the join. Since this is an OUTER
join type, Postgres substituted a bunch of blank data for the missing row in the pet_record
table on the right, since its data was optional. If we wanted to represent the same query as a RIGHT
join, we just need to move things around a bit:
SELECT t.*, p.*
FROM pet_record p
RIGHT JOIN pet_type t USING (type_id);
Now the records on the right are the source of the JOIN
and those on the left are represented by NULL
if they don’t exist. Otherwise, the output is exactly the same as before. Because of this, it’s fairly uncommon to encounter RIGHT JOIN
in the wild. Western countries use writing systems that read left-to-right, so it’s likely joins will also follow that trend.
In that case, how is FULL
different? Combine the idea of LEFT
and RIGHT
, and you get FULL
. Essentially when joining two tables with FULL
, rows that don’t exist on the right are represented by NULL
, and rows that don’t exist on the left are NULL
as well. This kind of output makes sense when combining reports or other aggregated output, since some summaries won’t have an equivalent on either side, but we still want the data.
In instances like our relational diagram, it’s not possible to observe the result of a FULL JOIN
on the IDs because at least one ID always exists. But… we can fake it by joining on a column they don’t share, and never matches:
SELECT t.*, p.*
FROM pet_record p
FULL JOIN pet_type t ON (p.pet_name = t.type_name);
type_id | type_name | record_id | type_id | owner_id | pet_name
---------+-----------+-----------+---------+----------+--------------
1 | Cat | | | |
4 | Cow | | | |
2 | Dog | | | |
| | 2 | 1 | 1 | Fluffy
| | 4 | 1 | 2 | Garfield
3 | Horse | | | |
| | 6 | 3 | 3 | Mr. Ed
| | 3 | 1 | 1 | Mr. Whiskers
| | 5 | 2 | 2 | Spinee
It’s like we just haphazardly glued the tables together into one giant table. But examining the output itself, the NULL
values always occur when there’s no match for the column(s) we used for the join itself. Thus FULL
is operating exactly as advertised.
OUTER JOIN
is normally used for optional relationships, like in our example where there might be a type of pet nobody owns. It can also be used to find club members that currently own no pets. The point here is that a regular JOIN
would exclude these situations and potentially produce erroneous data in a report. OUTER JOIN
fills a very defined role, and even though it’s not as dominant, it’s indispensable.
As a final note on JOINs, there is one easily made mistake that generally results in something of a disaster. Imagine we wanted to know who owned each pet, and we rush the query with some cut-and-paste magic. This is what we find:
SELECT o.owner_name, p.pet_name, t.type_name
FROM pet_record p
JOIN pet_type t ON (t.type_id = p.type_id)
JOIN pet_owner o ON (t.type_id = p.type_id);
owner_name | pet_name | type_name
------------------+--------------+-----------
Catherine L. Ady | Fluffy | Cat
Catherine L. Ady | Mr. Whiskers | Cat
Catherine L. Ady | Garfield | Cat
Catherine L. Ady | Spinee | Dog
Catherine L. Ady | Mr. Ed | Horse
Peter Venkman | Fluffy | Cat
Peter Venkman | Mr. Whiskers | Cat
Peter Venkman | Garfield | Cat
Peter Venkman | Spinee | Dog
Peter Venkman | Mr. Ed | Horse
Wilbur Post | Fluffy | Cat
Wilbur Post | Mr. Whiskers | Cat
Wilbur Post | Garfield | Cat
Wilbur Post | Spinee | Dog
Wilbur Post | Mr. Ed | Horse
Oh no! There’s no way every pet is owned by every person! What we have here is a Cartesian Product, and is the bane of databases everywhere. In our case, it happened because we joined the pet_owner
table on type_id
instead of owner_id
. Since the join clause can be anything, Postgres dutifully joined every row in the owner table to every row in the record table. This operation is multiplicative, and so the complexity grows exponentially based on the size of the source tables.
We got off lucky because our tables are small example sets. Imagine what would have happened if we were joining two million-row tables! Producing, transmitting, and receiving one trillion rows instead of a million might just cause some problems. Even small results of a couple thousand rows can be amplified tremendously into millions of results due to a malformed JOIN
. However, this example actually provides some valuable insight about what’s going on during the join itself.
Consider the overzealous output again. The rows we want are there, it’s just that the invalid ones weren’t pruned before they were returned to us. There are clearly some efficiency models in place to avoid exhaustive combinations, but this really is how joins work. They combine every row from one table with every row in another table, and weed out the results that don’t match the JOIN
criteria. Because of that, the output can be anything from a boring and expected list of pet owners and their pets, or a nonsensical combination of everything. It’s all in the clauses.
Here’s what would have happened if we didn’t suffer from cut-and-paste-itis:
SELECT o.owner_name, p.pet_name, t.type_name
FROM pet_record p
JOIN pet_type t ON (t.type_id = p.type_id)
JOIN pet_owner o ON (o.owner_id = p.owner_id);
owner_name | pet_name | type_name
------------------+--------------+-----------
Catherine L. Ady | Fluffy | Cat
Catherine L. Ady | Mr. Whiskers | Cat
Peter Venkman | Garfield | Cat
Peter Venkman | Spinee | Dog
Wilbur Post | Mr. Ed | Horse
Much better. This is one of the main reasons databases like Postgres are so versatile. Not only will they give you exactly what you ask for, but the diversity of the results covers much of what is possible through Set Theory. The mathematics of Set Theory are justifiably impenetrable to someone without a degree in Mathematics—and even some that do—yet the power lurking there is undeniable. For everyone else, the basics are more than sufficient.
That kind of adaptability is a double-edged sword to be sure, but knowing the pitfalls is the best way to avoid them. Postgres is one of those things that’s easy to use and hard to master precisely because of the Math it represents. Now that you have some of the tools necessary to use Postgres, feel free to try and master it.
I still haven’t, but that doesn’t mean I won’t stop trying.