PG Phriday: Elephantary, My Dear
Occasionally with a lot of similar-sounding terminology, there’s ample room for misunderstandings. This is especially relevant with overloaded terms like ‘index’, which can be anything from a data lookup to a term used in mutual funds. This is further complicated if a developer’s first experience with databases is with another variant with sufficiently divergent syntax, methodology, or assumptions. To prevent future code refactors born of misunderstandings, let’s build a basic Postgres glossary to act as an executive summary for app devs.
Let’s consider this a continuation of our exploration of databases and schemas. We need a solid glossary of terms centered around frequently conflated concepts.
Tables
At its core, a Postgres table is just a container for columns of various types. Each record in a table is stored as a row of one or more columns. We can see that easily enough here:
CREATE TABLE animal_sound
(
animal VARCHAR,
sound VARCHAR
);
INSERT INTO animal_sound (animal, sound) VALUES ('Cow', 'Moo');
INSERT INTO animal_sound (animal, sound) VALUES ('Cat', 'Meow');
SELECT * FROM animal_sound;
animal | sound
--------+-------
Cow | Moo
Cat | Meow
Nothing groundbreaking here, right? Even the most rank database newbie should know this much. It’s when we start adding elements on top that things start to go sideways. Let’s take this one step at a time to prevent that from happening.
Sequences
The next thing a developer will probably want is an auto-incrementing field of some kind. MySQL does this with their AUTO_INCREMENT
decorator. MS SQL uses a function called IDENTITY
that needs parameters to determine operation. Oracle requires something called a SEQUENCE
that must either be used explicitly or tied to the table with a TRIGGER
. Postgres uses sequences too, but they tend to be hidden behind the SERIAL
data type.
DROP TABLE animal_sound;
CREATE TABLE animal_sound
(
id SERIAL,
animal VARCHAR,
sound VARCHAR
);
INSERT INTO animal_sound (animal, sound) VALUES ('Cow', 'Moo');
INSERT INTO animal_sound (animal, sound) VALUES ('Cat', 'Meow');
SELECT * FROM animal_sound;
id | animal | sound
----+--------+-------
1 | Cow | Moo
2 | Cat | Meow
\d animal_sound
Column | Type | Modifiers
--------+-------------------+------------------------------------------------
id | integer | not null default nextval('animal_sound_id_seq')
animal | character varying |
sound | character varying |
Behind the scenes, Postgres is transforming the SERIAL type into a SEQUENCE with a DEFAULT value for the column. This makes the Postgres approach something of a middle-ground between the strict Oracle approach, and the much looser MySQL decorator. We saw this explicitly by asking Postgres to tell us the structure of the table.
If we truly desired to use a manual approach, it would have looked like this:
DROP TABLE animal_sound;
CREATE SEQUENCE animal_sound_id_seq;
CREATE TABLE animal_sound
(
id INT DEFAULT nextval('animal_sound_id_seq'),
animal VARCHAR,
sound VARCHAR
);
ALTER SEQUENCE animal_sound_id_seq OWNED BY animal_sound.id;
So that SERIAL
type really saved us quite a bit of work. We didn’t need to explicitly create the sequence, and we didn’t need to marry it to the table column that uses it. That last ALTER SEQUENCE
statement is necessary so Postgres knows the sequence is associated with the table. This way, if we drop the table, or export it into a dump file, the sequence (and its most recent value!) is included. Basically, It becomes integral to the table’s operation.
Since a sequence is a separate database object, it has its own attributes. Beyond just the starting value and increment size, we can tweak wraparound, caching, and a couple of other things. There really is very little magic in a Postgres database, and anything that seems magical is probably just a wrapper for something we could have done by hand. This makes for a very powerful hybrid approach that covers both the common and advanced use cases.
Keep in mind that sequences being a separate entity from the table means the value is not affected by underlying operations on the table or its data. What happens to a sequence if we delete all data from a table it’s associated with?
DELETE FROM animal_sound;
SELECT last_value FROM animal_sound_id_seq;
last_value
------------
3
So far as the sequence is concerned, the next ‘id’ value will be 3 unless it’s explicitly set to a different value. If this were MySQL or SQL Server, we’d start back at 1. This may seem like a limitation, but it means sequences can be used for things other than incrementing data for a single table column. One sequence can be shared by many tables, used to generate other values, or just act as a seed for some other process. These are all uncommon or otherwise advanced techniques, but they remain possible due to sequences remaining distinct entities.
Indexes
At its most basic, an index is just a functional calculation of some kind, performed on the value of one or more table columns. The result is then processed into some kind of storage structure that is optimized for instant (or at least very fast) lookup. The Postgres default is to use a B-tree, but there are numerous other options.
Let’s add an index to the id
column of our fake table:
CREATE INDEX idx_animal_sound_id ON animal_sound (id);
\d animal_sound
Column | Type | Modifiers
--------+-------------------+------------------------------------------------
id | integer | not null default nextval('animal_sound_id_seq')
animal | character varying |
sound | character varying |
Indexes:
"idx_animal_sound_id" btree (id)
Just like other database-level objects, indexes exist distinctly and separately from every other object in the database. They utilize storage, and incur overhead. After all, new or updated data necessarily invokes the function that maintains indexed versions of database columns. This is why we don’t simply index every column in a table.
Due to this, there are only two cases where Postgres will automatically create an index: for primary keys and unique constraints. We’ll discuss these special cases next. For now though, consider that indexes exist separately as objects we can manipulate independently as with other base-level objects. They do not replace the column data, nor even represent it. Indexes are just a lookup structure bolted onto the table, with implicit routines to maintain themselves for fast lookup.
Indexes come into use when dereferencing the lookup is faster than just reading the entire table. As such, our example table is far too small to see an index in action. But we can get around that:
CREATE TABLE huge_blob AS
SELECT id FROM generate_series(1, 1000) id;
CREATE INDEX idx_huge_blob_id ON huge_blob (id);
EXPLAIN
SELECT *
FROM huge_blob
WHERE id = 17;
QUERY PLAN
-----------------------------------------------------
Index Only Scan using idx_huge_blob_id on huge_blob
(cost=0.28..8.29 rows=1 width=4)
Index Cond: (id = 17)
See that? Instead of reading the entire table, Postgres calculated the hash for the lookup value of 17, and found that position in the B-tree to obtain the actual location of the row. There’s a longer discussion to have regarding good index candidates, optimal index types, index creation parameters and other minutiae, but we’ll save that for a future article.
Constraints
Constraints are rules we want to apply to our table columns. Maybe they should only represent specific values, or reject certain combinations. The most common however, are PRIMARY KEY
and UNIQUE
constraints. And why not? A primary key effectively represents the main lookup value for a row in our table, and the need for unique combinations is easily understood.
Let’s start with primary keys, since it’s very rare to encounter a table without one. Let’s modify our sample table a bit and show one in action:
DROP TABLE animal_sound;
CREATE TABLE animal_sound
(
id SERIAL,
animal VARCHAR,
color VARCHAR,
sound VARCHAR
);
ALTER TABLE animal_sound
ADD CONSTRAINT animal_sound_pkey
PRIMARY KEY (id);
INSERT INTO animal_sound (id, animal, color, sound)
VALUES (1, 'Cow', 'Brown', 'Moo');
INSERT INTO animal_sound (id, animal, color, sound)
VALUES (1, 'Cow', 'Spotted', 'Moo?');
ERROR: duplicate key value violates unique
constraint "animal_sound_pkey"
\d animal_sound
Column | Type | Modifiers
--------+-------------------+------------------------------------------------
id | integer | not null default nextval('animal_sound_id_seq')
animal | character varying |
color | character varying |
sound | character varying |
Indexes:
"animal_sound_pkey" PRIMARY KEY, btree (id)
We can see here that the primary key prevented a duplicate value in the id
column of our table. Adding the primary key also gave us a “free” index on the column as well. This should be expected, since we need to check candidates to prevent utilizing existing id values. Without this constraint, both insert statements would have succeeded, and we’d have no easy way to differentiate the data at a later date without using every column.
With the primary key in place, we can always rely on the id
column being unique, and acting as the direct representative of each table row. Do not confuse a primary key with an index however! A primary key is a constraint that uses an index.
This is a statement that also applies to unique constraints. Let’s add one and test it out:
DROP TABLE animal_sound;
CREATE TABLE animal_sound
(
id SERIAL PRIMARY KEY,
animal VARCHAR,
color VARCHAR,
sound VARCHAR
);
ALTER TABLE animal_sound
ADD CONSTRAINT udx_animal_sound_by_animal_color
UNIQUE (animal, color);
INSERT INTO animal_sound (animal, color, sound)
VALUES ('Cow', 'Brown', 'Moo');
INSERT INTO animal_sound (animal, color, sound)
VALUES ('Cow', 'Spotted', 'Moo?');
INSERT INTO animal_sound (animal, color, sound)
VALUES ('Cow', 'Brown', 'MooOOOoOo');
ERROR: duplicate key value violates unique
constraint "udx_animal_sound_by_animal_color"
\d animal_sound
Column | Type | Modifiers
--------+-------------------+------------------------------------------------
id | integer | not null default nextval('animal_sound_id_seq')
animal | character varying |
color | character varying |
sound | character varying |
Indexes:
"animal_sound_pkey" PRIMARY KEY, btree (id)
"udx_animal_sound_by_animal_color" UNIQUE CONSTRAINT, btree (animal, color)
There are a couple new pieces here. First, notice that we took advantage of some shorthand to create the primary key this time around. By adding the PRIMARY KEY
decorator after a column definition, Postgres will automatically add the primary key constraint and choose a name for us following a fairly simple naming scheme. It’s still the same constraint, but like SERIAL
, this method saves us a lot of work.
Next, we can see that our unique constraint automatically created an index on our behalf. This works the same way as the primary key; Postgres prevents non-unique values by checking new rows against the existing record of unique combinations. In our case, we’re allowed to have a brown or spotted cow, but not two separate brown cows. Again, a unique constraint is not an index, but uses a unique index to enforce the constraint.
There’s a lot to learn about constraints, but for now, they’re beyond the scope of this article. We’ll probably cover them in the future, of course!
Summary
In the end, we only really need to remember that all of the base elements discussed here are individually distinct. A column is not an index. A sequence is not a primary key. An index is not a constraint. A primary key is not an index. But a column can be indexed. Sequences can provide default values to a column. Constraints can utilize an index to enforce themselves.
Tables, sequences, and indexes can be tied together with syntactical glue, but exist independently. Don’t get these things mixed up because other database engines use them differently or deviate conceptually. While invoking the handy SERIAL
and PRIMARY KEY
shorthand, keep in mind what’s really going on under the hood. There may be occasions where it matters, or we need to leverage the distinction in our favor for advanced use cases.
At least Postgres gives us that option.