I’ve been going on and on about esoteric Postgres features for so long, sometimes I forget my own neophyte beginnings. So let’s back way off and talk about how JOINs work. What are they? What do they do? Why would you use one? What can go wrong? It’s easier than you might imagine, but there are definitely some subtleties to account for.
To make this more relatable, let’s start with a club of pet owners that likes to keep track of everything:
CREATE TABLE pet_type ( type_id SERIAL PRIMARY KEY, type_name VARCHAR NOT NULL ); CREATE TABLE pet_owner ( owner_id SERIAL PRIMARY KEY, owner_name VARCHAR NOT NULL ); CREATE TABLE pet_record ( record_id SERIAL PRIMARY KEY, type_id INT NOT NULL REFERENCES pet_type (type_id), owner_id INT NOT NULL REFERENCES pet_owner (owner_id), pet_name VARCHAR NOT NULL ); INSERT INTO pet_type (type_name) VALUES ('Cat'), ('Dog'), ('Horse'); INSERT INTO pet_owner (owner_name) VALUES ('Catherine L. Ady'), ('Peter Venkman'), ('Wilbur Post'); INSERT INTO pet_record (type_id, owner_id, pet_name) VALUES (1, 1, 'Fluffy'), (1, 1, 'Mr. Whiskers'), (1, 2, 'Garfield'), (2, 2, 'Spinee'), (3, 3, 'Mr. Ed');
Now that we have some data, we want to retrieve it in a way that’s relevant and descriptive. I could go on a long and exhausting description regarding
JOIN theory, but those already exist and probably do a better job anyway. Instead, I’ll just say there are two types of
INNER JOINs are the default in Postgres. Because it’s the default, we don’t actually need to use the
INNER keyword. Legacy systems don’t always allow this, so some ORMs and end users still use
INNER JOIN explicitly, but it isn’t necessary or even encouraged with Postgres. The
INNER keyword simply means that rows must exist in both tables to appear in the results.
For example, let’s list all pets and their pet type:
SELECT p.pet_name, t.type_name FROM pet_record p JOIN pet_type t ON (t.type_id = p.type_id); pet_name | type_name --------------+----------- Fluffy | Cat Mr. Whiskers | Cat Garfield | Cat Spinee | Dog Mr. Ed | Horse
This is the least complicated type of
JOIN there is. In this particular case, the
type_id column exists in both
pet_record. We can use that information to join the tables together so they resemble a single table. From that pool of columns, we can pick and choose the ones that interest us in the
Postgres also supports some shorthand syntax that simplifies the process:
SELECT p.pet_name, t.type_name FROM pet_record p JOIN pet_type t USING (type_id);
The primary caveat to
USING is that the
JOIN columns aren’t explicitly listed. If there are multiple tables with similarly named columns involved in a
JOIN, Postgres might complain about ambiguity, and then we’d have to revert to the regular syntax. Still, it’s a nice shortcut for most queries.
If that’s an
INNER JOIN, what then is an
OUTER JOIN? This is a bit more difficult to visualize, but the logic behind it is this: if a row from one table does not exist in the other, represent the missing data as
NULL. In other words, since we’re combining two tables to be a larger, virtual table, don’t remove records simply because they’re missing.
As an example, let’s add another pet type and then count how many pets of each type exist. First with a regular
JOIN, and then with an
INSERT INTO pet_type (type_name) VALUES ('Cow'); -- First with a standard JOIN SELECT t.type_name, count(p.record_id) AS total FROM pet_type t JOIN pet_record p USING (type_id) GROUP BY t.type_name; type_name | total -----------+------- Cat | 3 Horse | 1 Dog | 1 -- Now use an OUTER JOIN SELECT t.type_name, count(p.record_id) AS total FROM pet_type t LEFT JOIN pet_record p USING (type_id) GROUP BY t.type_name; type_name | total -----------+------- Cow | 0 Cat | 3 Horse | 1 Dog | 1 -- Here's the "table" the OUTER JOIN made SELECT t.*, p.* FROM pet_type t LEFT JOIN pet_record p USING (type_id); type_id | type_name | record_id | type_id | owner_id | pet_name ---------+-----------+-----------+---------+----------+-------------- 1 | Cat | 2 | 1 | 1 | Fluffy 1 | Cat | 3 | 1 | 1 | Mr. Whiskers 1 | Cat | 4 | 1 | 2 | Garfield 2 | Dog | 5 | 2 | 2 | Spinee 3 | Horse | 6 | 3 | 3 | Mr. Ed 4 | Cow | | | |
The idea here is that nobody owns a cow, but we still want the pet type listed so we can see that nobody owns a cow. Otherwise, we might not even consider that owning a cow is even an option. But we didn’t even use the
OUTER keyword! What’s with this
LEFT nonsense? It just so happens that, like
OUTER keyword is also optional in Postgres. What actually makes a
OUTER JOIN are the
Looking at the virtual table produced by the
JOIN itself, it’s clear that we only have information supplied by the table that was on the left side of the join. Since this is an
OUTER join type, Postgres substituted a bunch of blank data for the missing row in the
pet_record table on the right, since its data was optional. If we wanted to represent the same query as a
RIGHT join, we just need to move things around a bit:
SELECT t.*, p.* FROM pet_record p RIGHT JOIN pet_type t USING (type_id);
Now the records on the right are the source of the
JOIN and those on the left are represented by
NULL if they don’t exist. Otherwise, the output is exactly the same as before. Because of this, it’s fairly uncommon to encounter
RIGHT JOIN in the wild. Western countries use writing systems that read left-to-right, so it’s likely joins will also follow that trend.
In that case, how is
FULL different? Combine the idea of
RIGHT, and you get
FULL. Essentially when joining two tables with
FULL, rows that don’t exist on the right are represented by
NULL, and rows that don’t exist on the left are
NULL as well. This kind of output makes sense when combining reports or other aggregated output, since some summaries won’t have an equivalent on either side, but we still want the data.
In instances like our relational diagram, it’s not possible to observe the result of a
FULL JOIN on the IDs because at least one ID always exists. But… we can fake it by joining on a column they don’t share, and never matches:
SELECT t.*, p.* FROM pet_record p FULL JOIN pet_type t ON (p.pet_name = t.type_name); type_id | type_name | record_id | type_id | owner_id | pet_name ---------+-----------+-----------+---------+----------+-------------- 1 | Cat | | | | 4 | Cow | | | | 2 | Dog | | | | | | 2 | 1 | 1 | Fluffy | | 4 | 1 | 2 | Garfield 3 | Horse | | | | | | 6 | 3 | 3 | Mr. Ed | | 3 | 1 | 1 | Mr. Whiskers | | 5 | 2 | 2 | Spinee
It’s like we just haphazardly glued the tables together into one giant table. But examining the output itself, the
NULL values always occur when there’s no match for the column(s) we used for the join itself. Thus
FULL is operating exactly as advertised.
OUTER JOIN is normally used for optional relationships, like in our example where there might be a type of pet nobody owns. It can also be used to find club members that currently own no pets. The point here is that a regular
JOIN would exclude these situations and potentially produce erroneous data in a report.
OUTER JOIN fills a very defined role, and even though it’s not as dominant, it’s indispensable.
As a final note on JOINs, there is one easily made mistake that generally results in something of a disaster. Imagine we wanted to know who owned each pet, and we rush the query with some cut-and-paste magic. This is what we find:
SELECT o.owner_name, p.pet_name, t.type_name FROM pet_record p JOIN pet_type t ON (t.type_id = p.type_id) JOIN pet_owner o ON (t.type_id = p.type_id); owner_name | pet_name | type_name ------------------+--------------+----------- Catherine L. Ady | Fluffy | Cat Catherine L. Ady | Mr. Whiskers | Cat Catherine L. Ady | Garfield | Cat Catherine L. Ady | Spinee | Dog Catherine L. Ady | Mr. Ed | Horse Peter Venkman | Fluffy | Cat Peter Venkman | Mr. Whiskers | Cat Peter Venkman | Garfield | Cat Peter Venkman | Spinee | Dog Peter Venkman | Mr. Ed | Horse Wilbur Post | Fluffy | Cat Wilbur Post | Mr. Whiskers | Cat Wilbur Post | Garfield | Cat Wilbur Post | Spinee | Dog Wilbur Post | Mr. Ed | Horse
Oh no! There’s no way every pet is owned by every person! What we have here is a Cartesian Product, and is the bane of databases everywhere. In our case, it happened because we joined the
pet_owner table on
type_id instead of
owner_id. Since the join clause can be anything, Postgres dutifully joined every row in the owner table to every row in the record table. This operation is multiplicative, and so the complexity grows exponentially based on the size of the source tables.
We got off lucky because our tables are small example sets. Imagine what would have happened if we were joining two million-row tables! Producing, transmitting, and receiving one trillion rows instead of a million might just cause some problems. Even small results of a couple thousand rows can be amplified tremendously into millions of results due to a malformed
JOIN. However, this example actually provides some valuable insight about what’s going on during the join itself.
Consider the overzealous output again. The rows we want are there, it’s just that the invalid ones weren’t pruned before they were returned to us. There are clearly some efficiency models in place to avoid exhaustive combinations, but this really is how joins work. They combine every row from one table with every row in another table, and weed out the results that don’t match the
JOIN criteria. Because of that, the output can be anything from a boring and expected list of pet owners and their pets, or a nonsensical combination of everything. It’s all in the clauses.
Here’s what would have happened if we didn’t suffer from cut-and-paste-itis:
SELECT o.owner_name, p.pet_name, t.type_name FROM pet_record p JOIN pet_type t ON (t.type_id = p.type_id) JOIN pet_owner o ON (o.owner_id = p.owner_id); owner_name | pet_name | type_name ------------------+--------------+----------- Catherine L. Ady | Fluffy | Cat Catherine L. Ady | Mr. Whiskers | Cat Peter Venkman | Garfield | Cat Peter Venkman | Spinee | Dog Wilbur Post | Mr. Ed | Horse
Much better. This is one of the main reasons databases like Postgres are so versatile. Not only will they give you exactly what you ask for, but the diversity of the results covers much of what is possible through Set Theory. The mathematics of Set Theory are justifiably impenetrable to someone without a degree in Mathematics—and even some that do—yet the power lurking there is undeniable. For everyone else, the basics are more than sufficient.
That kind of adaptability is a double-edged sword to be sure, but knowing the pitfalls is the best way to avoid them. Postgres is one of those things that’s easy to use and hard to master precisely because of the Math it represents. Now that you have some of the tools necessary to use Postgres, feel free to try and master it.
I still haven’t, but that doesn’t mean I won’t stop trying.