PG Phriday: All in a Name

Friday July 08, 2016

Page content

Naming objects in a database can sometimes be an exercise in frustration. What kind of guidelines should we follow for schemas and tables? What about columns or variables? Should the same rules apply to indexes, constraints, and sequences? Functions and triggers are much different than all of those elements, yet still exist within the same namespace. Then we have to contend with Postgres reserved words, many of which are probably only obvious to someone well versed in database lingo.

Luckily, a few short rules essentially address of most of these questions directly. For objects that defy description or are otherwise elusive, we can still converge on something workable. Let’s start with a short list of basics:

Don’t quote.
Underscores separate words.
Motive counts.
Prefixes prevent problems.
Specific as possible.

Not only are these rules simple to apply, they make a great mnemonic acronym: DUMPS! Let’s explore each of these in depth and consider why they help make the database a place everyone can share and enjoy.

Don’t Quote

This is a rarity, but not uncommon in systems facilitated by an ORM or some kind of design tool that directly translates user inputs into object names:

CREATE TABLE "User Forms"
(
  "ID"            SERIAL NOT NULL PRIMARY KEY,
  "Quick Key"     VARCHAR NOT NULL,
  "EntryType"     VARCHAR NOT NULL,
  "Table"         JSON NOT NULL,
  date_created    TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now()
);

Quoting allows us to override both case insensitivity and any reserved words Postgres might have—heck, we can even use spaces. While some developers and applications interpret this as consent, it greatly complicates actually using the objects. Quoted labels that contain mixed capitalization, special characters, or reserved words must be quoted forever. Whether it’s a DBA hand-writing a query, a reporting team trying to mine a series of tables for correlations, or an ORM.

The Postgres list of reserved words reveals that Postgres is actually more permissive than the standards dictate. The word “value” for example, might not be accepted in other database platforms, while Postgres will allow using it without quotes. This becomes a problem in mixed environments where various tools transmit data back and forth between database engines. The inconsistent behavior regarding reserved words means potential for unexpected results. There’s also the potential for Postgres to disallow common reserved words in the future.

Generally it’s best to simply avoid any database-esque words for object names and nullify the issue outright. Luckily following the specificity guideline pretty much solves syntax conflicts automatically, since our specific object name is very unlikely to be reserved by Postgres.

Underscores

Unlike most languages, SQL is case insensitive. Let’s take a look at what that means in practice:

CREATE OR REPLACE FUNCTION doSomethingCool(myVar INT)
RETURNS INT AS
$$
BEGIN
  RETURN myVar * myVar;
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION DoSomethingCool(myVar INT)
RETURNS INT AS
$$
BEGIN
  RETURN myVar;
END;
$$ LANGUAGE plpgsql;

SELECT doSomethingCool(5);

 dosomethingcool 
-----------------
               5

While Postgres allows function overloading by specifying different arguments, that’s not what happened here. Instead, our carefully crafted CamelCase distinctions are discarded upon evaluation. As a consequence, the function names are identical, and the second definition overwrites the first one.

But it gets even more insidious than that. Check this out:

CREATE OR REPLACE FUNCTION check_me_out(nEst INT)
RETURNS INT AS
$$
DECLARE
  nest BOOLEAN; -- Should we nest the output?
BEGIN
  -- ... other code here.
  RETURN nEst;
END;
$$ LANGUAGE plpgsql;

SELECT check_me_out(5);

 dosomethingcool 
-----------------

Wait, what? This time, “nEst” is the same as “nest”, opening the potential for unexpected or undefined behavior. While this could arguably be called a Postgres bug since we shouldn’t be able to declare a variable with the same name as a functional parameter, it’s just one example.

Otherwise, the rule is simple: use underscores to split words. We don’t want onegiantunreadablestring, so it’s the only reliable way to retain user readability. Postgres doesn’t preserve mixed case at all in object names unless they’re quoted. As a consequence, it’s up to us to avoid those situations.

Here’s another illustration to drive the point home:

CREATE TABLE UserFormsAndOtherStuff (stuff INT);

\dt UserFormsAndOtherStuff

                 List of relations
 Schema |          Name          | Type  |  Owner   
--------+------------------------+-------+----------
 public | userformsandotherstuff | table | postgres

My eyes!

Motive Counts

This is essentially an extension of our specificity rule. This time, we’re incorporating the use case along with any necessary context and intent behind our objects. What does this mean? Take a look at these structures:

CREATE TABLE toll_sensor (...);
CREATE OR REPLACE VIEW v_all_active_sensors AS ...;

CREATE INDEX idx_sensor_ignore_inactive
    ON toll_sensor (sensor_id) WHERE is_online;

ALTER TABLE toll_sensor_log
  ADD CONSTRAINT chk_sensor_log_reject_invalid CHECK ...;

CREATE OR REPLACE FUNCTION f_log_traffic_encounter(...);
CREATE OR REPLACE FUNCTION f_nullify_logged_transponder(...);

CREATE TRIGGER t_sensor_log_advisory_audit_a_iud ...;

In a database context, it’s assumed tables will store multiple things. In fact, it’s probably easier to consider the name of the table as a description for a single row it contains. In our above toll_sensor table, each row within the table is a sensor. This suggests table names should be singular. Simultaneously, view names are an instantiation of a query, meaning they’re a result set and therefore inherently plural.

When we think about indexes, why is the index necessary? Is it just a basic index for a column, or is it necessary to facilitate a subset of searches? Views are essentially reports, so why does the view exist, and what is it returning? Constraints define how a table acts in many cases, and we should list those conditions or some summary of them. With functions, we want to know what the function does, and if possible, to what. Triggers can do a lot of things, and since they add quite a bit of overhead, we really want to explain their presence.

Just consider the context, and this should be easy.

Prefixes

Database schemas often become a literal dumping ground of database objects. Sometimes from multiple different competing applications. Imagine we just wrote an application to track automated highway tolling. We know in the future that there could be companion applications that use the same data, but might rely on tables of their own.

Here’s how that might look:

CREATE SCHEMA toll_system;
SET search_path TO toll_system;

CREATE TABLE toll_sensor (
  sensor_id  SERIAL NOT NULL,
  ...
);

ALTER TABLE toll_sensor
  ADD CONSTRAINT pk_toll_sensor PRIMARY KEY (sensor_id);

CREATE TABLE toll_transponder (
  transponder_id  SERIAL PRIMARY KEY NOT NULL,
  ...
);

CREATE TABLE toll_traffic_log (
  traffic_log_id  SERIAL PRIMARY KEY NOT NULL,
  ...
);

CREATE UNIQUE INDEX udx_traffic_log_encounter (...);

CREATE VIEW v_transponder_history AS
SELECT ...;

CREATE FUNCTION f_log_transponder(...)
RETURNS BOOLEAN AS ...;

CREATE TRIGGER t_traffic_log_audit_a_iud
AFTER INSERT OR UPDATE OR DELETE ON toll_traffic_log ...;

Things like tables, sequences, functions, views, indexes, and triggers all exist in the same namespace, and many even share the same context. We can avoid confusion as well as collisions by prefixing. The “rules” for this are straight-forward:

Label tables with the primary intent. For a tolling system, this would be “toll”, while an inventory system might prefer “inv”.
Use “v” for views.
Use “f” for functions. (Or “sp” for “stored procedure” if you prefer.)
Use “t” for triggers. (Optionally, suffix the trigger firing conditions: “b” for before, “a” for after, and any one of “i”, “u”, and “d” for insert, update, and delete.)
Indexes can be “idx” for a regular index, or “udx” for a unique index.
Indexes should also be prefixed with the parent table name, sans the table’s prefix.
Primary keys should use “pk”.
Foreign keys should use “fk”.
Boolean columns should be prefixed with “is”, “can”, “has”, etc.

Complicated? Maybe, but only slightly. It’s easy to summarize: abbreviate, shorten, and add. It’s hard to miss that we’re just using the object type as a letter prefix. This way, anyone looking at the schema can tell which things are views, tables, sequences, and so on. Further, examining the table will immediately show which indexes are unique, which constraints are a foreign key or a check, and so on. The point is consistency; this isn’t a bible.

These prefixes give us structure visibility, as well as some peace of mind in avoiding naming collisions in the giant blob of a namespace.

Specificity

Motive explains how and why, while being specific tells us what. Every once in a while, I come across a table like this:

CREATE TABLE storage
(
  id       SERIAL NOT NULL PRIMARY KEY,
  key      VARCHAR NOT NULL,
  type     VARCHAR NOT NULL,
  data     JSON NOT NULL,
  created  TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now()
);

Databases can contain hundreds or thousands of tables, each of which may encapsulate dozens or hundreds of rows. What is “storage”? What are we storing? Data? What kind of data? From where? This kind of table is a nightmare to decipher. There are no clues to track down which app might be using it, and no way to tell how the table is related to other tables in the schema.

A column name such as “data” may seem like an acceptable idea for a single application, but when there are a dozen applications which have their own “data”, it quickly starts to lose meaning. Sure, the application developer probably knows what all of these extremely generic terms refer to, but does anyone else? In this case, the intent was probably to use “key” as some kind of lookup value as often seen with key/value pairs in associative arrays or noSQL-based composite text keys. And likewise the “data” column is probably the associated JSON object for that lookup key.

But why should anyone have to guess? This is only one example, but there are many like it. Let’s assume this table is used by an application that stores user testimonies. The application captures a lot of varying data from polymorphic forms, so the JSON can’t really be transformed into a reliable set of columns. With those constraints, we might try this instead:

CREATE TABLE user_testimony
(
  testimony_id   SERIAL NOT NULL PRIMARY KEY,
  lookup_key     VARCHAR NOT NULL,
  comment_type   VARCHAR NOT NULL,
  user_feedback  JSON NOT NULL,
  date_created   TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now()
);

Isn’t that better? Now a quick glance reveals everything we need to know about the table and its contents. Tables are a back-end storage element, and are often shared by multiple applications and user groups. Shared resources should be as descriptive as possible to prevent confusion and potential misinterpretation or misuse.

This applies to any database object. Views should distinctly describe their purpose. Columns need to differentiate themselves from columns in other tables to prevent join ambiguity. Functions, sequences, everything should have a clearly distinguishable name.

Conclusion

This is a lot to consider. The DUMPS acronym helps, but an executive summary might be a simple: consistent and descriptive is best. A consistent naming scheme will prevent a lot of future headaches, and being descriptive forces us to think about the data we’re storing and why. Not only can this help us consider the architecture before setting it in stone, but it conveys meaning by simply existing.

Anyone who joins the company in the future can make sense of data that is at least partially self-documented. Queries can be more consistent and precise, with less ambiguity in every facet from column names to join conditions. Applications can be extended, or new APIs written to leverage old data. Why wade through ancient and probably undocumented code to determine original intent or business logic?

Just remember that access vectors are myriad and ephemeral, but data is forever. No matter how hard it is to think of good names, don’t let it get you down in the DUMPS.