PG Phriday: Stuck in the Middle with Postgres

Friday March 10, 2017

Earlier this year, I implied Postgres was some kind of super middleware for dragging data out of every external resource it could locate. But that example only used the Postgres foreign data wrapper to contact another Postgres server. Why be so unimaginative? The future is as unlimited as it is terrifying.

Let’s start with a few prerequisites. We want to see the combined functionality of a few different Postgres capabilities, so let’s try and combine data from MySQL, Postgres, and Javascript using V8. To do that, we need a couple of easily obtained resources. Debian-based systems can do something like this:

sudo apt-get install postgresql-9.6-plv8 postgresql-9.6-mysql-fdw

Otherwise Postgres packages are available on the Postgres site or as some other name in your favorite distribution. All we need for this is some kind of MySQL install, and the MySQL and V8 Postgres extensions. More ambitious readers can try applying these techniques to SQL Server as well.

Now imagine we have a marketing infrastructure with data strewn all over the world, in a vast assortment of incompatible or legacy apps, each with its own portion of the data we want to see. Not an ideal situation really, and one normally solved by ETL to pull all of the data into a single location for perusal. But transformation takes time, and we’re exceptionally lazy. Instead of all that extract, transform, and load action, let’s just query the data directly.

We can get started by creating a basic market MySQL table with two measly rows for demonstration purposes:

CREATE TABLE market_data (
  ext_root   VARCHAR(10) PRIMARY KEY,
  trader     VARCHAR(20) NOT NULL,
  stock      VARCHAR(10) NOT NULL,
  action     CHAR(1) NOT NULL,
  price      NUMERIC(10,4) NOT NULL,
  xact_date  TIMESTAMP
);

INSERT INTO market_data (ext_root, trader, stock, action, price)
       VALUES ('IZ724JJ5:3', 'bsmith', 'GOOG', 'b', 800);
INSERT INTO market_data (ext_root, trader, stock, action, price)
       VALUES ('IZ724JJ5:4', 'bsmith', 'GOOG', 's', 842);

Easy, right? The rows should automatically have a timestamp applied as if they’re part of a rapidly accumulating feed. This data isn’t normalized because that would greatly complicate the data model. For the purposes of this activity, that’s not really necessary. If it helps, imagine this is just a view on top of all of the relevant tables and flattens the data to something an external resource can easily consume.

The next thing we need is some data in a Postgres database, because why not? Consider this on some server in a far-away land filled with unicorns and butterflies cavorting with innocent children in a field of daisies. It’s definitely not in our datacenter!

CREATE TABLE trade_calc (
  ext_root  VARCHAR(10) PRIMARY KEY,
  stdev     NUMERIC NOT NULL,
  mean      NUMERIC NOT NULL,
  delta     NUMERIC NOT NULL
);

INSERT INTO trade_calc VALUES ('IZ724JJ5:3', 13.7, 825, 11.3);
INSERT INTO trade_calc VALUES ('IZ724JJ5:4', 8.5, 832, 1.5);

In this case, the table contains supplementary information regarding the trade. Maybe an application is generating various calculations derived from the trade and storing it in a local resource for later consumption. Regardless, we’ve retained the external root identifier used to tie the rows together, which is a common feature of market data.

The next step is to introduce some magical binding substance of some kind.

We’ll be installing all of these wonderful resources in a Postgres database dedicated specifically for that purpose. One with user accounts and grants in a centralized location where everyone and his pet fish can access and aggregate data stored there. The MySQL portion of our tentacle would look and function something like this:

CREATE EXTENSION mysql_fdw;

CREATE SERVER mysql_market
  FOREIGN DATA WRAPPER mysql_fdw
  OPTIONS (host 'localhost');

CREATE USER MAPPING FOR postgres
  SERVER mysql_market
  OPTIONS (username 'root');

CREATE SCHEMA mysql;

IMPORT FOREIGN SCHEMA market LIMIT TO (market_data)
  FROM SERVER mysql_market
  INTO mysql;

SELECT * FROM mysql.market_data;

  ext_root  | trader | stock | action |  price   |      xact_date      
------------+--------+-------+--------+----------+---------------------
 IZ724JJ5:3 | bsmith | GOOG  | b      | 800.0000 | 2017-03-10 15:09:32
 IZ724JJ5:4 | bsmith | GOOG  | s      | 842.0000 | 2017-03-10 15:09:45

One foreign MySQL resource allocated and tested. Check!

Without a readily available node.js or REST interface to contact, we can simulate polling some kind of JSON resource with the PLV8 language. This a good way to demonstrate tying elements to Postgres using something other than SQL. It also gives us a way to grab data for situations where no foreign data wrapper exists and we’re not confident enough to create one.

In this case, we’re calling a V8 function which gets the data through some dubious means and sends it back to us:

CREATE EXTENSION plv8;

CREATE SCHEMA js;
SET search_path TO js;

CREATE TYPE clearing_data AS (
  ext_root   VARCHAR(10),
  market     VARCHAR(10),
  bank       VARCHAR(10),
  is_funded  BOOLEAN
);

CREATE OR REPLACE FUNCTION get_clearing_info()
RETURNS SETOF clearing_data AS
$$
  plv8.return_next({
    "ext_root": "IZ724JJ5:3", "market": "NASDAQ",
    "bank": "Chase", "is_funded": false
  });
  plv8.return_next({
    "ext_root": "IZ724JJ5:4", "market":
    "NASDAQ", "bank": "Citi", "is_funded": true
  });
$$ LANGUAGE plv8;

CREATE VIEW all_clearing_data AS
SELECT * FROM get_clearing_info();

SELECT * FROM js.all_clearing_data;

  ext_root  | market | bank  | is_funded 
------------+--------+-------+-----------
 IZ724JJ5:3 | NASDAQ | Chase | f
 IZ724JJ5:4 | NASDAQ | Citi  | t

Success! Hard-coded data isn’t exactly robust, but demos are open season, and hey look over there! In a real scenario, we’d probably have a function that accepted parameters and performed a pull from some external resource. The view is just a way of hiding the function call for users who might have their head explode to know they can select rows from a function.

Lastly we need to contact that Postgres server out in the magical field of unicorns, butterflies, and candy canes. What, I didn’t mention candy last time? Well there’s candy now. Deal with it.

CREATE EXTENSION postgres_fdw;

CREATE SERVER pgsql_market
  FOREIGN DATA WRAPPER postgres_fdw
  OPTIONS (dbname 'market', host 'localhost');

CREATE USER MAPPING FOR postgres
  SERVER pgsql_market
  OPTIONS (user 'postgres');

CREATE SCHEMA pgsql;

IMPORT FOREIGN SCHEMA PUBLIC LIMIT TO (trade_calc)
  FROM SERVER pgsql_market
  INTO pgsql;

SELECT * FROM pgsql.trade_calc;

  ext_root  | stdev | mean | delta 
------------+-------+------+-------
 IZ724JJ5:3 |  13.7 |  825 |  11.3
 IZ724JJ5:4 |   8.5 |  832 |   1.5

Essentially that was the same process we used with MySQL, except this time we used the Postgres FDW instead. Either way, it works and returns data exactly as expected.

The final step is to slap on the duct tape and invoke everything at once:

SELECT m.*, t.*
  FROM mysql.market_data m
  JOIN pgsql.trade_calc t USING (ext_root)
  JOIN js.all_clearing_data c USING (ext_root);

  ext_root  | trader | stock | action |  price   |      xact_date      |  ext_root  | stdev | mean | delta 
------------+--------+-------+--------+----------+---------------------+------------+-------+------+-------
 IZ724JJ5:3 | bsmith | GOOG  | b      | 800.0000 | 2017-03-10 15:09:32 | IZ724JJ5:3 |  13.7 |  825 |  11.3
 IZ724JJ5:4 | bsmith | GOOG  | s      | 842.0000 | 2017-03-10 15:09:45 | IZ724JJ5:4 |   8.5 |  832 |   1.5

The real power here is in combining this with the external resource gathering techniques we explored in the previously mentioned article. Is the external REST call way too slow? Slap on a materialized view and index the crap out of it. Have a super stable API for an internally developed resource? Write a foreign data wrapper so Postgres can invoke it directly. Then add a materialized view for good measure. Add a refresh schedule for any resources that adversely affect performance for direct inquiries.

What we end up with is a container database that represents 0% of its own data. It’s a resource we can truncate at will, may crash with no consequences, and might exist anywhere since it’s nothing but an interface layer. A Postgres instance designed this way becomes a SQL API to access assorted data feeds through a standardized execution model. Heck, everyone can have their very own.

Were we to attempt this with our favorite language or application stack, we would probably need to write our own data joining techniques, implement set theory, incorporate hashing and sorting algorithms, and so on. In the end, we’d have written a tiny database engine to handle data combination and aggregation, without most capabilities of even a mediocre out-of-the-box solution. In this scenario, Postgres bridges that gap for us.

If Postgres has already done all the hard work, why replicate it? And isn’t that what being lazy is all about?

See Also