PG Phriday: Elephantary, My Dear

Friday August 05, 2016

Page content

Occasionally with a lot of similar-sounding terminology, there’s ample room for misunderstandings. This is especially relevant with overloaded terms like ‘index’, which can be anything from a data lookup to a term used in mutual funds. This is further complicated if a developer’s first experience with databases is with another variant with sufficiently divergent syntax, methodology, or assumptions. To prevent future code refactors born of misunderstandings, let’s build a basic Postgres glossary to act as an executive summary for app devs.

Let’s consider this a continuation of our exploration of databases and schemas. We need a solid glossary of terms centered around frequently conflated concepts.

Tables

At its core, a Postgres table is just a container for columns of various types. Each record in a table is stored as a row of one or more columns. We can see that easily enough here:

CREATE TABLE animal_sound
(
  animal  VARCHAR,
  sound   VARCHAR
);

INSERT INTO animal_sound (animal, sound) VALUES ('Cow', 'Moo');
INSERT INTO animal_sound (animal, sound) VALUES ('Cat', 'Meow');

SELECT * FROM animal_sound;

 animal | sound 
--------+-------
 Cow    | Moo
 Cat    | Meow

Nothing groundbreaking here, right? Even the most rank database newbie should know this much. It’s when we start adding elements on top that things start to go sideways. Let’s take this one step at a time to prevent that from happening.

Sequences

The next thing a developer will probably want is an auto-incrementing field of some kind. MySQL does this with their AUTO_INCREMENT decorator. MS SQL uses a function called IDENTITY that needs parameters to determine operation. Oracle requires something called a SEQUENCE that must either be used explicitly or tied to the table with a TRIGGER. Postgres uses sequences too, but they tend to be hidden behind the SERIAL data type.

DROP TABLE animal_sound;

CREATE TABLE animal_sound
(
  id      SERIAL,
  animal  VARCHAR,
  sound   VARCHAR
);

INSERT INTO animal_sound (animal, sound) VALUES ('Cow', 'Moo');
INSERT INTO animal_sound (animal, sound) VALUES ('Cat', 'Meow');

SELECT * FROM animal_sound;

 id | animal | sound 
----+--------+-------
  1 | Cow    | Moo
  2 | Cat    | Meow

\d animal_sound

 Column |       Type        |                   Modifiers                   
--------+-------------------+------------------------------------------------
 id     | integer           | not null default nextval('animal_sound_id_seq')
 animal | character varying | 
 sound  | character varying |

Behind the scenes, Postgres is transforming the SERIAL type into a SEQUENCE with a DEFAULT value for the column. This makes the Postgres approach something of a middle-ground between the strict Oracle approach, and the much looser MySQL decorator. We saw this explicitly by asking Postgres to tell us the structure of the table.

If we truly desired to use a manual approach, it would have looked like this:

DROP TABLE animal_sound;

CREATE SEQUENCE animal_sound_id_seq;

CREATE TABLE animal_sound
(
  id      INT DEFAULT nextval('animal_sound_id_seq'),
  animal  VARCHAR,
  sound   VARCHAR
);

ALTER SEQUENCE animal_sound_id_seq OWNED BY animal_sound.id;

So that SERIAL type really saved us quite a bit of work. We didn’t need to explicitly create the sequence, and we didn’t need to marry it to the table column that uses it. That last ALTER SEQUENCE statement is necessary so Postgres knows the sequence is associated with the table. This way, if we drop the table, or export it into a dump file, the sequence (and its most recent value!) is included. Basically, It becomes integral to the table’s operation.

Since a sequence is a separate database object, it has its own attributes. Beyond just the starting value and increment size, we can tweak wraparound, caching, and a couple of other things. There really is very little magic in a Postgres database, and anything that seems magical is probably just a wrapper for something we could have done by hand. This makes for a very powerful hybrid approach that covers both the common and advanced use cases.

Keep in mind that sequences being a separate entity from the table means the value is not affected by underlying operations on the table or its data. What happens to a sequence if we delete all data from a table it’s associated with?

DELETE FROM animal_sound;

SELECT last_value FROM animal_sound_id_seq;

 last_value 
------------
          3

So far as the sequence is concerned, the next ‘id’ value will be 3 unless it’s explicitly set to a different value. If this were MySQL or SQL Server, we’d start back at 1. This may seem like a limitation, but it means sequences can be used for things other than incrementing data for a single table column. One sequence can be shared by many tables, used to generate other values, or just act as a seed for some other process. These are all uncommon or otherwise advanced techniques, but they remain possible due to sequences remaining distinct entities.

Indexes

At its most basic, an index is just a functional calculation of some kind, performed on the value of one or more table columns. The result is then processed into some kind of storage structure that is optimized for instant (or at least very fast) lookup. The Postgres default is to use a B-tree, but there are numerous other options.

Let’s add an index to the id column of our fake table:

CREATE INDEX idx_animal_sound_id ON animal_sound (id);

\d animal_sound

 Column |       Type        |                   Modifiers                   
--------+-------------------+------------------------------------------------
 id     | integer           | not null default nextval('animal_sound_id_seq')
 animal | character varying | 
 sound  | character varying | 
Indexes:
    "idx_animal_sound_id" btree (id)

Just like other database-level objects, indexes exist distinctly and separately from every other object in the database. They utilize storage, and incur overhead. After all, new or updated data necessarily invokes the function that maintains indexed versions of database columns. This is why we don’t simply index every column in a table.

Due to this, there are only two cases where Postgres will automatically create an index: for primary keys and unique constraints. We’ll discuss these special cases next. For now though, consider that indexes exist separately as objects we can manipulate independently as with other base-level objects. They do not replace the column data, nor even represent it. Indexes are just a lookup structure bolted onto the table, with implicit routines to maintain themselves for fast lookup.

Indexes come into use when dereferencing the lookup is faster than just reading the entire table. As such, our example table is far too small to see an index in action. But we can get around that:

CREATE TABLE huge_blob AS
SELECT id FROM generate_series(1, 1000) id;

CREATE INDEX idx_huge_blob_id ON huge_blob (id);

EXPLAIN
SELECT *
  FROM huge_blob
 WHERE id = 17;

                    QUERY PLAN 
-----------------------------------------------------
 Index Only Scan using idx_huge_blob_id on huge_blob
       (cost=0.28..8.29 rows=1 width=4)
   Index Cond: (id = 17)

See that? Instead of reading the entire table, Postgres calculated the hash for the lookup value of 17, and found that position in the B-tree to obtain the actual location of the row. There’s a longer discussion to have regarding good index candidates, optimal index types, index creation parameters and other minutiae, but we’ll save that for a future article.

Constraints

Constraints are rules we want to apply to our table columns. Maybe they should only represent specific values, or reject certain combinations. The most common however, are PRIMARY KEY and UNIQUE constraints. And why not? A primary key effectively represents the main lookup value for a row in our table, and the need for unique combinations is easily understood.

Let’s start with primary keys, since it’s very rare to encounter a table without one. Let’s modify our sample table a bit and show one in action:

DROP TABLE animal_sound;

CREATE TABLE animal_sound
(
  id      SERIAL,
  animal  VARCHAR,
  color   VARCHAR,
  sound   VARCHAR
);

ALTER TABLE animal_sound
  ADD CONSTRAINT animal_sound_pkey
      PRIMARY KEY (id);

INSERT INTO animal_sound (id, animal, color, sound)
       VALUES (1, 'Cow', 'Brown', 'Moo');
INSERT INTO animal_sound (id, animal, color, sound)
       VALUES (1, 'Cow', 'Spotted', 'Moo?');

ERROR:  duplicate key value violates unique
        constraint "animal_sound_pkey"

\d animal_sound

 Column |       Type        |                   Modifiers                   
--------+-------------------+------------------------------------------------
 id     | integer           | not null default nextval('animal_sound_id_seq')
 animal | character varying | 
 color  | character varying | 
 sound  | character varying | 
Indexes:
    "animal_sound_pkey" PRIMARY KEY, btree (id)

We can see here that the primary key prevented a duplicate value in the id column of our table. Adding the primary key also gave us a “free” index on the column as well. This should be expected, since we need to check candidates to prevent utilizing existing id values. Without this constraint, both insert statements would have succeeded, and we’d have no easy way to differentiate the data at a later date without using every column.

With the primary key in place, we can always rely on the id column being unique, and acting as the direct representative of each table row. Do not confuse a primary key with an index however! A primary key is a constraint that uses an index.

This is a statement that also applies to unique constraints. Let’s add one and test it out:

DROP TABLE animal_sound;

CREATE TABLE animal_sound
(
  id      SERIAL PRIMARY KEY,
  animal  VARCHAR,
  color   VARCHAR,
  sound   VARCHAR
);

ALTER TABLE animal_sound
  ADD CONSTRAINT udx_animal_sound_by_animal_color
      UNIQUE (animal, color);

INSERT INTO animal_sound (animal, color, sound)
       VALUES ('Cow', 'Brown', 'Moo');
INSERT INTO animal_sound (animal, color, sound)
       VALUES ('Cow', 'Spotted', 'Moo?');
INSERT INTO animal_sound (animal, color, sound)
       VALUES ('Cow', 'Brown', 'MooOOOoOo');

ERROR:  duplicate key value violates unique
        constraint "udx_animal_sound_by_animal_color"

\d animal_sound

 Column |       Type        |                   Modifiers                   
--------+-------------------+------------------------------------------------
 id     | integer           | not null default nextval('animal_sound_id_seq')
 animal | character varying | 
 color  | character varying | 
 sound  | character varying | 
Indexes:
    "animal_sound_pkey" PRIMARY KEY, btree (id)
    "udx_animal_sound_by_animal_color" UNIQUE CONSTRAINT, btree (animal, color)

There are a couple new pieces here. First, notice that we took advantage of some shorthand to create the primary key this time around. By adding the PRIMARY KEY decorator after a column definition, Postgres will automatically add the primary key constraint and choose a name for us following a fairly simple naming scheme. It’s still the same constraint, but like SERIAL, this method saves us a lot of work.

Next, we can see that our unique constraint automatically created an index on our behalf. This works the same way as the primary key; Postgres prevents non-unique values by checking new rows against the existing record of unique combinations. In our case, we’re allowed to have a brown or spotted cow, but not two separate brown cows. Again, a unique constraint is not an index, but uses a unique index to enforce the constraint.

There’s a lot to learn about constraints, but for now, they’re beyond the scope of this article. We’ll probably cover them in the future, of course!

Summary

In the end, we only really need to remember that all of the base elements discussed here are individually distinct. A column is not an index. A sequence is not a primary key. An index is not a constraint. A primary key is not an index. But a column can be indexed. Sequences can provide default values to a column. Constraints can utilize an index to enforce themselves.

Tables, sequences, and indexes can be tied together with syntactical glue, but exist independently. Don’t get these things mixed up because other database engines use them differently or deviate conceptually. While invoking the handy SERIAL and PRIMARY KEY shorthand, keep in mind what’s really going on under the hood. There may be occasions where it matters, or we need to leverage the distinction in our favor for advanced use cases.

At least Postgres gives us that option.

Tables

Sequences

Indexes

Constraints

Summary

See Also