PG Phriday: Postgres as Middleware

One of the cool things I like most about Postgres, is that it’s probably the most inclusive database software I’ve ever encountered. It’s so full of features and functionality these days, it’s practically middleware. Almost anything plugs into it, and if it doesn’t, there’s usually a way to make it happen.

Want a demonstration?

SciDB is often used for large analytical data warehouses. They even use Postgres for metadata storage. Despite this, they still haven’t written a foreign data wrapper for back-and-forth interaction. But they have written a Python API. So what can we do with it?

This isn’t a SciDB tutorial, so let’s just create a very simple data array with one attribute over one dimension. Nothing crazy here:

CREATE ARRAY thing  [id=0:10,10,0];
store(build(thing, 'Thing ' + string(id)), thing);
scan(thing);

{id} msg
{0} 'Thing 0'
{1} 'Thing 1'
{2} 'Thing 2'
{3} 'Thing 3'
{4} 'Thing 4'
{5} 'Thing 5'
{6} 'Thing 6'
{7} 'Thing 7'
{8} 'Thing 8'
{9} 'Thing 9'
{10} 'Thing 10'

The hardest part about accessing this data via Postgres, was actually getting the SciDB python driver to work. Instead of directly utilizing their own client libraries, it actually obtains data through an HTTP proxy they called shim. Once that was done, I just had to find the right calls in the SciDB-Py documentation and refresh my memory on PL/Python.

Given all of those parts, our whole stack resembles something like this:

  1. SciDB
  2. Shim
  3. SciDB-Py
  4. PL/Pythonu
  5. Postgres

If there were an actual foreign data wrapper, we could use it to replace the middle three layers. Until then, that’s the overhead involved. And here is all of the magic for viewing that external data:

CREATE TYPE scidb_thing AS (
  id   int,
  msg  text
);

CREATE OR REPLACE FUNCTION get_scidb_things()
RETURNS SETOF scidb_thing
AS $$
  import numpy
  import scidbpy as scidb
  sdb = scidb.connect('http://localhost:8080')

  for pair in sdb.afl.scan('thing').tosparse():
    yield pair

$$ LANGUAGE plpythonu;

CREATE OR REPLACE VIEW v_scidb_thing AS
SELECT *
  FROM get_scidb_things();

This usage comes in two parts, with a third for the sake of convenience. We create a type because we want to interact with this data as if it were a table. Then we create a function that uses python to access the external data, and cast it into that type. The end result is that we can get the external data rather easily. With a bit more coding, we could make the function generic and use it to access any external SciDB array.

The view merely obfuscates the function call to emulate local usage. Let’s use the view and see what we get:

SELECT *
  FROM v_scidb_thing
 WHERE id = 9;

 id |   msg   
----+---------
  9 | Thing 9

Given a lot of time and inspiration, we could write a local Postgres API with the capability to create, fill, or otherwise treat any SciDB array like any other external data object. I would never advocate such an approach in this case simply because of the awful overhead. The above query required 300ms to execute in my VM, which is about 100x slower than directly querying SciDB. But this is also all of the old stuff! Imagine what we could do with more recent techniques like a FDW extension.

The thing about Postgres is that it’s a database, and an interface. I just used Python to give it access to data in another system without a native client library or extension. The fact that the whole process only took about an hour makes it even more ludicrous. For small to medium-sized objects, we could combine this technique with materialized views for speedy local access.

Really, the only thing preventing Postgres from doing something is your imagination. That’s pretty liberating.