PG Phriday: COPY and Alternative Import Methods

Friday January 29, 2016

I recently noted that the COPY command in Postgres doesn’t have syntax to skip columns in source data when importing it into a table. This necessitates using one or more junk columns to capture data we’ll just be throwing away. During that, I completely forgot that friendly devs had contributed alternative file handling methods as Foreign Data Wrappers. Most people think of foreign wrappers as a method for interacting with remote databases. Perusing the full list however, reveals some surprising data sources. Twitter? Hive? Video cards?!

Well, let’s take a quick look at a real case. Here’s a table we want to import data into. We can’t insert into it directly, because of that extra column. So we’ll use an intermediate table as a raw import target, then insert into the target table. Let’s use COPY as a first approximation with 500,000 rows:

CREATE UNLOGGED TABLE option_quote_raw
(
  ext_quote_id BIGINT,
  quote_date DATE NOT NULL,
  quote_time_ns BIGINT NOT NULL,
  seq_no BIGINT NOT NULL,
  sub_seq_no BIGINT NOT NULL,
  exchange_id INT NOT NULL,
  time_us BIGINT NOT NULL,
  feed_side VARCHAR NOT NULL,
  flag INT NOT NULL,
  bid NUMERIC(16, 8) NOT NULL,
  ask NUMERIC(16, 8) NOT NULL,
  bid_size INT NOT NULL,
  ask_size INT NOT NULL,
  trash TEXT
);

CREATE UNLOGGED TABLE option_quote
(
  ext_quote_id BIGINT,
  quote_date DATE NOT NULL,
  quote_time_ns BIGINT NOT NULL,
  seq_no BIGINT NOT NULL,
  sub_seq_no BIGINT NOT NULL,
  exchange_id INT NOT NULL,
  time_us BIGINT NOT NULL,
  feed_side VARCHAR NOT NULL,
  flag INT NOT NULL,
  bid NUMERIC(16, 8) NOT NULL,
  ask NUMERIC(16, 8) NOT NULL,
  bid_size INT NOT NULL,
  ask_size INT NOT NULL
);

CREATE INDEX idx_quote_id ON option_quote (ext_quote_id);

\timing on

COPY option_quote_raw FROM '/tmp/quote_source.csv' WITH CSV;

Time: 1705.967 ms

INSERT INTO option_quote
SELECT ext_quote_id, quote_date, quote_time_ns, seq_no,
       sub_seq_no, exchange_id, time_us, feed_side, flag,
       bid, ask, bid_size, ask_size
  FROM option_quote_raw;

Time: 2062.863 ms

Just under four seconds for half a million records on a table that has a pre-existing index is pretty good. But that process is pretty convoluted. Going back to that list of foreign wrappers, there’s an included file_fdw for tying a table directly to a file. Let’s repeat the same process from above by using FDW syntax:

CREATE EXTENSION file_fdw;
CREATE SERVER file_server FOREIGN DATA WRAPPER file_fdw;

CREATE FOREIGN TABLE file_quote
(
  ext_quote_id BIGINT,
  quote_date DATE NOT NULL,
  quote_time_ns BIGINT NOT NULL,
  seq_no BIGINT NOT NULL,
  sub_seq_no BIGINT NOT NULL,
  exchange_id INT NOT NULL,
  time_us BIGINT NOT NULL,
  feed_side VARCHAR NOT NULL,
  flag INT NOT NULL,
  bid NUMERIC(16, 8) NOT NULL,
  ask NUMERIC(16, 8) NOT NULL,
  bid_size INT NOT NULL,
  ask_size INT NOT NULL,
  trash TEXT
)
SERVER file_server
OPTIONS (format 'csv', filename '/tmp/quote_source.csv');

\timing on

INSERT INTO option_quote
SELECT ext_quote_id, quote_date, quote_time_ns, seq_no,
       sub_seq_no, exchange_id, time_us, feed_side, flag,
       bid, ask, bid_size, ask_size
  FROM file_quote;

Time: 3867.848 ms

Looks like the overall time is roughly the same. Of course, the best use case for this approach is when using a standard file path location that doesn’t change. The contents of the “table” do not resolve until it’s used, so the filename option could represent a standard file upload location. This means the foreign file table can be repeatedly reused, while the raw import table in our first attempt needs to be truncated and refilled before the INSERT statement.

There’s also the case of archives, where CSV files are compressed for long term storage. What if we want to examine the contents of these files, or import them at a later date? Luckily, the Multicorn extension exists as a kind of generic python FDW API. Using that API, anyone can use Python to build a quick foreign data wrapper. Someone did just that and contributed an extension that can process gzip compressed files.

Let’s see how that works:

CREATE EXTENSION multicorn;

CREATE SERVER csvgz FOREIGN DATA WRAPPER multicorn OPTIONS (
    WRAPPER 'csvgz_fdw.CSVGZForeignDataWrapper'
);

CREATE FOREIGN TABLE file_quote_gz
(
  ext_quote_id BIGINT,
  quote_date DATE NOT NULL,
  quote_time_ns BIGINT NOT NULL,
  seq_no BIGINT NOT NULL,
  sub_seq_no BIGINT NOT NULL,
  exchange_id INT NOT NULL,
  time_us BIGINT NOT NULL,
  feed_side VARCHAR NOT NULL,
  flag INT NOT NULL,
  bid NUMERIC(16, 8) NOT NULL,
  ask NUMERIC(16, 8) NOT NULL,
  bid_size INT NOT NULL,
  ask_size INT NOT NULL,
  trash TEXT
)
SERVER csvgz
OPTIONS (format 'csv', file_name '/tmp/quote_source.csv.gz');

\timing on

INSERT INTO option_quote
SELECT ext_quote_id, quote_date, quote_time_ns, seq_no,
       sub_seq_no, exchange_id, time_us, feed_side, flag,
       bid, ask, bid_size, ask_size
  FROM file_quote_gz;

Time: 16563.306 ms

Though this takes just shy of 17 seconds for the same 500k records, probably because the stack includes Multicorn, which depends on Python. Despite multiple levels of redirection, at least it’s possible! There is a native C version, but the author neglected to include build instructions. It’s possible performance of this extension could be improved with better streaming, as zcat on the same file requires less than a second. It also opens up the exciting potential of another extension that has the compression engine as another option, so tables could also access bzip2, xz, or other compressed formats.

Compressed files aside, is there any way to handle wildcards? Sometimes it makes sense to process all CSV files in a single folder, for example. Well, that doesn’t exist yet, but at the rate new wrappers are appearing, it’s only a matter of time. It’s also possible there’s already one out there, but it simply isn’t on the Wiki page, or the author hasn’t formalized it for contribution yet. In fact, there are probably quite a few foreign data wrappers that share that particular attribute.

It’s great to see other import methods appearing to address deficits in COPY functionality. I’ll definitely be watching the Wiki page for more.

See Also