PG Phriday: Database and Schema Basics

Sure, managing a Postgres cluster is great. We’ve got the software installed, more tool wrappers than we can wave several sticks at, and a running cluster that’s ready to do our bidding. But what do we do with it? In fact, what is a Postgres cluster anyway? How are databases involved? What about schemas? There’s critically important nomenclature that new users may find somewhat non-intuitive, so let’s demystify things a bit.

First, I’ll go on record that I never really liked the “cluster” label when it comes to how it’s often used in relation to Postgres. When people in the tech industry think of a cluster, they probably imagine something like this. Yes, a cluster can simply mean a grouping of related things, but context is everything. This gets even more confusing in larger contexts where a “cluster” of Postgres servers is a legitimate architecture, for horizontal scalability or otherwise.

So let’s back away for a second and consider the components themselves. Starting with the Postgres software, we have something similar to how programmers think of objects: instances. Classes are unusable until they’re instantiated as an object (barring virtual classes, interfaces, and other special cases.) In this way, the Postgres software is just there, doing nothing, until there’s an instance that’s created and running. Whether we use initdb, pg_createcluster, or some other wrapper to do this, no Postgres databases can exist anywhere without an instance to hold them.

The primary job of an instance is to merely exist; by itself, a Postgres instance is not usable. Postgres instances are assigned a port number for traffic control purposes, so it’s entirely possible to run multiple simultaneous instances on the same server. But otherwise, you can’t create tables in one, and they certainly are not a database.

This should not blow anyone’s mind, but there is an implicit lie here. Every new Postgres instance gets created along with with a database named “postgres”. Because of this, it’s very common for new users to consider an instance the database itself. After all, they can connect to “postgres” and create tables, issue queries, and so on. When using tools like pg_createcluster, an easy mistake to make, is that each application needs its own Postgres “cluster”. Even the Postgres initdb tool is a giant misnomer (or perhaps more of an anachronism); it creates a new instance which can contain databases.

So what’s a database then, and how do we create one and subsequently use it? When connecting to a Postgres database, most drivers make usage pretty clear. In order to connect, we need a port number and the database name. A default install leaves this at port 5432, with a database named “postgres”. Thus it’s not a stretch to view databases as an exclusive namespace. Any tables we create in one database can not be seen or used from another database.

As for creation, let’s assume we have an instance running on the defaults already. There are two ways to create new custom databases. First, we can use the createdb utility from the command line any number of times:

createdb myapp
createdb yourapp
createdb anotherapp

Or we can connect to the “postgres” database as a superuser (the “postgres” user by default), and then create the new databases using SQL syntax:

CREATE DATABASE myapp;
CREATE DATABASE yourapp;
CREATE DATABASE anotherapp;

This of course raises the question: why would I want to create separate databases? Some use cases might not demand this, but it’s generally considered good practice to place objects in a container that describes its contents. The word “postgres” means nothing to anyone but people who really like Postgres, and those who interact with it regularly. It doesn’t suggest how data might be related, what application or project the data might belong to, or any other myriad reasons. It’s a nice playground, but shouldn’t really be used otherwise.

Some companies name databases after intended environment. In such cases, we often see “dev” or “stage” or “production” or some other variant. Others use the application name itself, making it fairly clear why the database exists, and what its primary role entails. Regardless of the approach, we strongly recommend creating databases with meaningful names, no matter how that might be interpreted.

But what comes next? Say we followed the rules and created a database for our application. Does that mean we can simply start creating tables there immediately? Yes, of course we can! But please do everyone a favor and refrain from doing so. Because as with all things Postgres, we often get a default we should try to avoid for more official projects. This time, that default is the public schema.

Here’s what happens if we connect and create a table right away:

CREATE TABLE foo (
  id        SERIAL PRIMARY KEY,
  foo_desc  TEXT,
  lame_num  INT
);

\dt

              List of relations
 Schema |      Name       | Type  |  Owner   
--------+-----------------+-------+----------
 public | foo             | table | postgres

Any object we create, so long as we don’t provide a different schema, goes in the public context. To see why this might be a problem, consider how often global variables are used in large projects. This can be even more problematic for databases, as they can contain data from any number of related applications within a project. If any of these share a single table, view, or function with the same name, we could have a serious issue.

Schemas are the object namespaces of the database world. Without explicitly defining the desired schema, any new table or view simply goes into public, since it has to live somewhere. There are a few ways to use schemas to our advantage, and Postgres provides a few methods to simplify this even further.

Let’s start with the basics and create a schema and recreate our table there:

CREATE SCHEMA stats;

CREATE TABLE stats.foo (
  id        SERIAL PRIMARY KEY,
  foo_desc  TEXT,
  lame_num  INT
);

INSERT INTO stats.foo (foo_desc, lame_num)
       VALUES ('This is a thing.', 10);

SELECT * FROM stats.foo;

 id |     foo_desc     | lame_num 
----+------------------+----------
  1 | This is a thing. |       10

So long as a schema exists, we can use it simply by prefixing the table, view, or function with its name. But let’s say that’s annoying, because honestly who wants to do more typing? We can repeat the exercise by setting the Postgres search_path variable within our session.

CREATE SCHEMA more_stats;
SET search_path TO more_stats;

CREATE TABLE foo (
  id        SERIAL PRIMARY KEY,
  foo_desc  TEXT,
  lame_num  INT
);

INSERT INTO foo (foo_desc, lame_num)
       VALUES ('This is a another thing.', 100);

SELECT * FROM stats.foo;

 id |     foo_desc     | lame_num 
----+------------------+----------
  1 | This is a thing. |       10

SELECT * FROM foo;

 id |         foo_desc         | lame_num 
----+--------------------------+----------
  1 | This is a another thing. |      100

Notice how we were able to utilize content from both versions of the foo table? By changing the schema search path, we essentially modify what data we can access without requiring a prefix. What’s more, we can specify multiple schemas in a single search path. Let’s create another table and try it out:

CREATE SCHEMA less_stats;
SET search_path TO less_stats, more_stats, stats;

CREATE TABLE bar (
  id        SERIAL PRIMARY KEY,
  bar_desc  TEXT,
  cool_num  INT
);

INSERT INTO bar (bar_desc, cool_num)
       VALUES ('I like cows.', 42);

SELECT * FROM foo;

 id |         foo_desc         | lame_num 
----+--------------------------+----------
  1 | This is a another thing. |      100

SELECT * FROM bar;

 id |   bar_desc   | cool_num 
----+--------------+----------
  1 | I like cows. |       42

\dt

          List of relations
   Schema   | Name | Type  |  Owner   
------------+------+-------+----------
 less_stats | bar  | table | postgres
 more_stats | foo  | table | postgres

This example was pretty overloaded with information, so let’s decompress a bit. What happened?

  1. The bar table was created in the new less_stats schema, so schemas are used in the order listed.
  2. The foo and bar tables are in different schemas, but we could use both without a prefix. So multiple schemas in the search path mean we don’t have to use the prefix for any objects they contain.
  3. Since both more_stats and stats contain a foo table, the more_stats version gets precedence. So not only are schemas used in the order listed, objects get view priority as well. Objects that exist in multiple schemas can only be “seen” without a prefix if they’re the first match.

Given these rules, there is some potential for undefined behavior when relying on search_path for larger projects that may contain multiple schemas. If any of those schemas contain tables or views with the same name, it might come as a surprise which version gets used without strict adherence to some kind of defined hierarchy. As such, for automated use, we generally recommend being explicit and prefixing table and view names in queries with the expected schema.

But for ad-hoc use, search paths are extremely handy. We can even “automate” this to a degree. In the postgresql.conf file, we can change the search_path to whatever we wish. Thus, anyone who connects to any database in an instance will automatically get a search path that removes public or enforces a certain schema view tree. Of course, doing this effectively exposes any of the objects in those schemas to a single global context, so some caution is probably warranted.

Finally, we can specify a search path and explicitly associate a database user with it. Thus, any time this user connects, they will get that list of schemas by default. This carries far less risk than setting it for all databases, and lets us customize access vectors based on how the connection might be used. Here’s how that might look for the stats application:

ALTER USER stats_read SET search_path TO less_stats, more_stats;

Now when the stats_read user connects, they don’t need to use prefixes when interacting with tables in those schemas. If that’s the primary role of the user, we don’t really have to worry about namespace collisions, since it’s probably a small subset of the available schemas.

Taking this idea further, A single database can contain schemas for each version of an application. Or hosting providers can restrict users to their own schema, so hundreds or even thousands of users can share a single database. Or perhaps each schema can correspond to a logical shard name, so the tables within reflect the same basic architecture, making shard-aware applications possible. In fact, this is how I designed shard_manager to work.

In the end, it boils down to this:

Instance -> Database -> Schema -> Object (table, view, etc.)

Each one of these can contain multiples of the thing it’s designed to hold. So unless there’s some kind of explicit need to physically separate data in such a way it needs another network port to access, multiple databases can coexist peacefully. Please keep this stack in mind before reaching for initdb, pg_createcluster, or similar tools.