PG Phriday: Inevitable Interdiction
“Hey! That row shouldn’t be in that table! How the heck did that get there!? Alright, who wrote the application client filters, because you’re fired!”
Good application developers know never to trust client input, but not all realize that a single app is rarely the only vector into a database. Databases don’t just preserve data with various levels of paranoia, they’re also the central nexus of a constellation of apps, scripts, APIs, GUIs, BMIs, HMOs, and STDs. As such, unless every single one of those share a common ancestor that sanitizes, boils, renders, and formats content before storage, there’s bound to be inconsistencies. That’s just how things work.
One of the major benefits of using an RDBMS, is that engines like Postgres provide several mechanisms for ensuring data integrity beyond crash durability and transaction control. Continuing our discussion on database objects from last week, we vaguely referred to other types of constraint. So, what other mechanisms are available aside from primary keys and unique constraints?
Let’s start with foreign keys, since they illustrate how tables are related and enforce that relation to reject invalid content. Here are two simply related tables and a couple of rows:
CREATE TABLE pet_type
(
type_id SERIAL PRIMARY KEY,
animal VARCHAR UNIQUE NOT NULL
);
CREATE TABLE pet
(
pet_id SERIAL PRIMARY KEY,
type_id INT NOT NULL,
pet_name VARCHAR NOT NULL,
owner_name VARCHAR NOT NULL
);
ALTER TABLE pet
ADD CONSTRAINT fk_pet_pet_type FOREIGN KEY (type_id)
REFERENCES pet_type (type_id);
INSERT INTO pet_type (animal) VALUES ('cat'), ('dog');
INSERT INTO pet
(type_id, pet_name, owner_name)
VALUES
(1, 'Meow Meow Fuzzyface', 'Cedric'),
(2, 'Mr. Peanutbutter', 'Paul');
Foreign keys provide a buffer between pending modifications by enforcing the relationship. In this case, we can’t remove “dog” as a pet type because at least one pet references it. We also can’t insert a pet fish, because there’s no corresponding type.
The other job the foreign key fills is to normalize the type names. If we had used a VARCHAR
column in the pet table instead, we could have types of “Dog”, “dog”, “GDo”, or any number of typos preserved for eternity and forever complicating searches. This helps illustrate and sanitize the relationship and all affected data. Let’s see it in action:
DELETE FROM pet_type WHERE animal = 'dog';
ERROR: update or delete on table "pet_type" violates foreign key
constraint "fk_pet_pet_type" on table "pet"
INSERT INTO pet
(type_id, pet_name, owner_name)
VALUES
(3, 'Wanda', 'Lisa');
ERROR: insert or update on table "pet" violates foreign key
constraint "fk_pet_pet_type"
Foreign keys do have other modes of operation. We could for instance, declare the constraint as ON DELETE CASCADE
. That would enable us to delete from the parent table, but a delete cascade would mean every row in any table that referenced the deleted value would also be removed. That’s a fairly dangerous operation, and a pretty good reason it’s not the default in Postgres.
Foreign keys are fairly limited in application, however. Beyond describing a relationship and enforcing its integrity, there isn’t much left. If we wanted to impose actual rules on the data itself, we need to go a bit further into the toolbox.
For example, suppose we want to ensure pet birth dates are firmly established in the past. None of the previously mentioned constraints will let us apply arbitrary rules on column values, right? That’s where CHECK
constraints come in!
ALTER TABLE pet ADD birth_date DATE;
ALTER TABLE pet
ADD CONSTRAINT ck_pet_no_future
CHECK (CURRENT_DATE - birth_date > 0);
INSERT INTO pet
(type_id, pet_name, owner_name, birth_date)
VALUES
(1, 'Princess Carolyn', 'Amy', '2017-08-12');
ERROR: new row for relation "pet" violates check
constraint "ck_pet_no_future"
Not bad, eh? Check constraints are exceptionally useful when there are very simple rules we want to enforce. Maybe prices should always be positive. Perhaps invoice line items should be positive unless they’re a credit. There is a lot of potential here, but there’s also room for abuse. There’s technically no limit on the amount of conditionals a check constraint enforces, or the number of checks we prescribe, so we must be judicious or risk performance degradation.
Still, preventing critical data flaws prior to insert is a stupendous capability. Can we go further, though? Of course we can! The final constraint type is for data exclusion. Imagine in our examples that pets can change owners, but can’t be owned by two people simultaneously. Well, we can’t use check constraints for that since they only operate on the current row, and a unique constraint won’t work either.
Let’s watch EXCLUDE
handle the situation with ease:
CREATE EXTENSION btree_gist;
ALTER TABLE pet ADD owner_range TSRANGE;
ALTER TABLE pet
ADD CONSTRAINT ex_owner_overlap
EXCLUDE USING GIST (
pet_name WITH =, owner_range WITH &&
);
INSERT INTO pet_type (animal) VALUES ('horse');
INSERT INTO pet
(type_id, pet_name, owner_name, birth_date, owner_range)
VALUES
(3, 'Bojack', 'Will', '1964-06-12', '[1964-06-12,2014-09-07)');
INSERT INTO pet
(type_id, pet_name, owner_name, birth_date, owner_range)
VALUES
(3, 'Bojack', 'Arnett', '1964-06-12', '[2013-09-07,)');
ERROR: conflicting key value violates exclusion
constraint "ex_owner_overlap"
There are a couple prerequisites for using exclusion this way, of course. Since the gist index type wasn’t designed to handle types like INT or VARCHAR natively, we need to give it B-Tree capability first with an extension.
Beyond that oddity, we merely added the new ownership date range and then added the constraint itself. The exclusion syntax is to list the column and then the type of operator that should be applied. For ranges, that operator is && to indicate overlap. For our particular example, no pet with the same name can have overlapping ownership ranges. This is a tiny universe indeed!
Exclusion constraints work better for things like scheduling, preventing archive overlaps, and other operations that would be awkward or impossible otherwise. With other database engines, an application might have to search for a date range and other parameters and self-verify that a record is safe to insert. Under this paradigm, any tertiary data vector that isn’t so diligent would be free to ignore scheduling conflicts.
But no application, script, or API can ignore rules the database itself enforces. That’s what constraints are for: those times when being a shared resource is a liability. Judicious application of various constraints can protect as well as describe the data, and make things easier (and safer) for everyone.
And if that isn’t the Postgres motto, maybe it should be.