PG Phriday: On the Move

Page content

Sometimes when we have an existing database and schema full of tables, there comes a time when we need to perform a migration. Maybe it’s because we want to move to or from a cloud service. Perhaps we have a small database and want to do a quick version upgrade via dump / restore. Whatever the reason, we may decide to clean up some technical debt while we’re making the transition.

Many Postgres experts recommend against creating objects in the public schema. This is the default schema that exists in nearly all Postgres databases, and there are often implicit grants that could make our objects available in unexpected scenarios. It’s also a cluttered namespace if all tables, views, functions, etc., are created there by default. Using it is sloppy and makes future data or operation segregation much more difficult.

So how can we move a bunch of existing stuff out of the public schema safely?

What’s a DVD?

In order to illustrate this, let’s use some fake data in a database that we want to relocate. The folks over at Postgresql Tutorial supply a sample database designed to model a DVD rental service. Uh… this may require a bit more background given our modern context.

Back before everything was a digital stream, we used to put media on shiny plastic discs called a “DVD”. Since these were physical objects, it was necessary to obtain them from a rental store. This was such a common service that someone took the time to create a database to illustrate the concept for learning purposes.

In this case, whoever designed this schema put everything in the public schema like some kind of maniac. Let’s import it for a better look:

wget https://www.postgresqltutorial.com/wp-content/uploads/2019/05/dvdrental.zip
unzip dvdrental.zip

createdb dvdrental
pg_restore -d dvdrental dvdrental.tar

The file is provided as a pg_dump tar file which has been compressed into a zip archive. As such, we can use the pg_restore utility to load that data into any database we choose. In this case, we’ve restored the DVD rental project into a database named dvdrental. At this point it should be ready to use, but… is that something we want to do?

Public Philistines

Let’s take a closer look at the bundle as it appears in the database itself.

dvdrental=# \dt

             List of relations
 Schema |     Name      | Type  |  Owner   
--------+---------------+-------+----------
 public | actor         | table | postgres
 public | address       | table | postgres
 public | category      | table | postgres
 public | city          | table | postgres
 public | country       | table | postgres
 public | customer      | table | postgres
 public | film          | table | postgres
 public | film_actor    | table | postgres
 public | film_category | table | postgres
 public | inventory     | table | postgres
 public | language      | table | postgres
 public | payment       | table | postgres
 public | rental        | table | postgres
 public | staff         | table | postgres
 public | store         | table | postgres

Eww. That’s nasty. And it’s not just all of those tables. All of the views, functions, and everything else necessary to run the DVD shop is all in the public schema. And these table names are so generic! It’s just an assortment of common nouns, so any other app we may want to install could easily conflict with these tables!

That’s a travesty. Let’s just start over.

dropdb dvdrental

There, much better!

Surgical Intervention

But what do we do now? Believe it or not, the pg_restore utility has several features that allow us to extract and interact with the contents in ways that will allow us to perform a bit of impromptu surgery. Unfortunately for us, the process we need to use will be slightly more convoluted than is entirely necessary.

It used to be possible to play cute games with the restore procedure. It was once possible to produce a list file of objects to restore, modify the list file by altering the schema target, and restore using that modified map. Due partially to an exploit discovered in 2018, dumps began to hard-code schema names into all dumped objects directly. This means all commands encoded into a dump include a schema prefix like this:

CREATE TYPE public.mpaa_rating AS ENUM (
    'G',
    'PG',
    'PG-13',
    'R',
    'NC-17'
);

This means we can’t even restrict restored objects to reside in schemas assigned to a user’s search path. How rude! Despite that, we can play a couple of tricks using sed to make it possible once again. Let’s begin by extracting the pre and post-data DDL commands from the dump.

pg_restore --section=pre-data -f pre-data.sql dvdrental.tar
pg_restore --section=post-data -f post-data.sql dvdrental.tar

These two sections exist because it’s much more efficient to load data into a table without indexes, constraints, triggers, and other elements. It’s best to create those structures once data is loaded, and we also avoid accidentally modifying our data with active triggers.

Now that we have separate DDL files, let’s modify them slightly using sed to replace public with dvd_rental as our target schema:

sed -i 's/public\./dvd_rental./g;' *-data.sql

This is a very simplistic approach, and should be easy to follow. It’s just a single search and replace on the SQL files we created. The next step is to actually create the schema and import the objects as a quick way to test the process:

createdb dvdrental

psql -c 'CREATE SCHEMA dvd_rental' dvdrental
psql -f pre-data.sql dvdrental

We should have no errors at this point. Now all we have is a collection of empty tables, so the next step is to import the data itself:

pg_restore -a dvdrental.tar -f - | 
    sed "
      s/^COPY public\./COPY dvd_rental./g;
      s/setval('public\./setval('dvd_rental./g;
    " | 
    psql dvdrental

By specifying pg_restore -a, we’re telling the restore process to only decode the data loading portions of the dump. This includes COPY commands for table contents, and setval() statements to bootstrap any sequences to the correct values. As expected, both of these statement types are prefixed with public such that object namespaces are explicit. All our sed command does is rewrite public to dvd_rental similarly to how we modified the pre and post scripts.

That output is simple SQL, so the last step is to redirect those commands to psql. We could have also created a pure SQL file instead for closer examination, and then performed the import as a distinct step. In any case, our data is in the database, so our final command is to apply the post-script:

psql -f post-data.sql dvdrental

Again, if we did everything right, there should be no errors.

Finally Settled

Now we can connect to our dvdrental database and examine the default public schema:

dvdrental=# \dt

Did not find any relations.

Success! But did everything go where we put it?

dvdrental=# SET search_path TO dvd_rental;
dvdrental=# \dt

               List of relations
   Schema   |     Name      | Type  |  Owner   
------------+---------------+-------+----------
 dvd_rental | actor         | table | postgres
 dvd_rental | address       | table | postgres
 dvd_rental | category      | table | postgres
 dvd_rental | city          | table | postgres
 dvd_rental | country       | table | postgres
 dvd_rental | customer      | table | postgres
 dvd_rental | film          | table | postgres
 dvd_rental | film_actor    | table | postgres
 dvd_rental | film_category | table | postgres
 dvd_rental | inventory     | table | postgres
 dvd_rental | language      | table | postgres
 dvd_rental | payment       | table | postgres
 dvd_rental | rental        | table | postgres
 dvd_rental | staff         | table | postgres
 dvd_rental | store         | table | postgres

SELECT COUNT(*) FROM film_actor;

 count 
-------
  5462

Success!

A Suitable Alternative

Sadly, this procedure is far from ideal. Due to the fact we’re manipulating the pure SQL behind the dump/restore system, we can no longer take advantage of parallel restore jobs. We’re also injecting a sed process directly into the restore pipeline. Even if that’s a negligible amount of overhead, most of our data is incurring unnecessary substitution matches.

Perhaps a better set of steps would be something like this:

  1. Create a clean and empty Postgres instance elsewhere. A good location for this is a Docker container.
  2. Restore the data to that instance using the -j parallel parameter.
  3. Use ALTER statements to move objects to a new schema.

At this point, we have two options. The first is to stick with the pg_dump and pg_restore utilities:

  1. Produce a new dump using the (-Fd) directory and (-j) parallel parameters.
  2. Restore this new schema-shifted dump to the final database using the -j parallel parameter.

Or we could use logical replication:

  1. Create a publication for the new schema in the donor database.
  2. Subscribe to the publication in the target database, and wait.

Depending on how much data we need to load, this may actually be faster since either approach can load data in parallel. Our standard dump / modify / restore procedure doesn’t have that luxury.

In the end, it would be better if Postgres provided this kind of functionality natively. If we are combining databases that both had existing objects in the public schema, neither are directly compatible thanks to potential name conflicts. If we are given a dump, there’s no legitimate way to modify it for safe import into an existing database because it could pollute the public namespace. There’s no real encapsulation or mapping when importing external data structures.

This may have been fine five or ten yeas ago, but increasing popularity brings more users and pre-existing data models. Thanks to scenarios like this, the built-in tools are becoming less relevant unless you want to jump through extra hoops. Logical replication helps, but Postgres currently restricts this to matching schema names so we can’t take advantage of it to move shift schemas between Postgres installations.

Still, in the grand scheme of things, this is a minor stumbling block. We technically have three ways to circumvent it with some rudimentary techniques. A determined user will always find a way after all!