PG Phriday: Let There Be Jank

One way the Postgres project is subtly misleading, is that it becomes easy to forget that not all other projects are nearly as well managed. This becomes more relevant when delving into niches that lack sufficient visibility to expose the more obvious deficiencies. As much as we like Postgres, it’s not quite as popular as it could be. This makes some of the side projects infrequently used, and as a direct consequence, they can often resemble jerky automatons cobbled together out of spit and bailing wire.

A good example of this is the hdfs_fdw extension for accessing Hadoop. To be fair, a large portion of its complexity is due to dependency on Hive and hastily assembled APIs like Thrift and fb303. Unfortunately it also suffers from “Works for Me” syndrome, lacking an autoconf to autodiscover these requirements, and it also doesn’t use automake to properly bootstrap the make environment. As a result, most builds will fail outright or require hand-modifying the Makefile—a task many will simply balk at before abandoning the extension outright.

So, let’s install the hdfs_fdw extension, going through all the necessary steps, and not just some incomplete shorthand that makes assumptions regarding the build system. To facilitate this, no default install locations will be used at all, because hardcoded defaults are how we got into this mess in the first place.

The hdfs_fdw extension depends on Thrift, so let’s start there. As of this writing, the latest version is 0.9.3. Here’s the full build process we used:

wget http://apache.osuosl.org/thrift/0.9.3/thrift-0.9.3.tar.gz
tar -xzf thrift-0.9.3.tar.gz
cd thrift-0.9.3
./configure --prefix=/opt/thrift
make -j2
sudo make install

So far, this is pretty normal. What isn’t normal, is that the thrift source includes a contrib module named fb303 we also need. Unlike the Postgres build environment, configuration settings and subsequent Makefile components do not cascade to the contrib modules. This is where the trouble begins.

The first issue is that, unlike Thrift, fb303 defaults to only static linking, and won’t generate a dynamic object file unless explicitly told to do so. Considering these components are packaged together, this restriction is unnecessarily jarring. Why choose between dynamic or static linking for a system library that may be used as either? In our case, we want dynamic shared objects, so we need to configure with --disable-static.

The second roadblock comes from the Thrift team itself, which closed a bug as “Cannot Reproduce” despite the fact one of their include path references is simply wrong. The user that reported the issue even provided a patch. So we need to fix that, too.

The whole process looks like this:

cd contrib/fb303
./bootstrap.sh --prefix=/opt/thrift --with-thriftpath=/opt/thrift \
    --disable-static
sed -i 's%/include/thrift%/include%' cpp/Makefile.am
make -j2
sudo make install

With Thrift out of the way, it’s time to move on to hdfs_fdw itself. Unfortunately we can’t quite do that yet. The hdfs_fdw extension is distributed with a library it depends on, but that is not accounted for in the Makefile. We need to build and install it separately for some reason. Further, this dependency has hardcoded paths in its own Makefile, so we must modify it or end up with the library in an arbitrary location, or with it unable to find required headers.

So let’s build the libhive library:

git clone https://github.com/EnterpriseDB/hdfs_fdw.git
cd hdfs_fdw/libhive
sed -i 's%THRIFT_HOME=.*%THRIFT_HOME=/opt/thrift/include%' Makefile
sed -i 's%INSTALL_DIR=.*%INSTALL_DIR=/opt/thrift/lib%' Makefile
make -j2
sudo make install

We elected to install libhive in the same location as Thrift because they’ll be used together in our case. This prevents cluttering up our /usr/local/lib directory, as well as allowing us to easily redistribute these prerequisites to our other Postgres systems since we don’t have a nice package.

Finally, we can build the hdfs_fdw extension itself. Or can we?

Unfortunately, we’re still not done setting up. Because we elected to install Thrift as an optional piece of software, and because we use dynamic linking, we need to tell the operating system where to find libraries. Usually this means modifying /etc/ld.so.conf and running ldconfig to re-read library paths. On Ubuntu and other Debian variants, we can actually put these in a subdirectory in a less intrusive fashion.

Here’s how that might look:

echo /opt/thrift/lib | sudo tee /etc/ld.so.conf.d/hive.conf
sudo /sbin/ldconfig

Our particular build environment is an Ubuntu system, which is a Debian variant. As such, build tools like pg_config are actually wrappers around the real utilities, which are hidden away in deep paths to prevent accidental use. This is to help facilitate having multiple Postgres versions on the same server, but it also complicates installing extensions, since the wrappers always assume the most recent version. Suppose Postgres 9.6 beta is on our system, but wanted to install an extension for 9.5?

That means we need to alter our path before building hdfs_fdw itself. Debian variants put everything we need in /usr/lib/postgresql/[version]/bin, but other UNIX systems may use a different location. We’ll need that information to proceed. In addition, since hdfs_fdw doesn’t use a configure script, we can’t tell it where to find the Thrift and fb303 libraries. This isn’t strictly necessary because we modified ldconfig, but it’s always better to be safe.

So, assuming we’re still in the hdfs_fdw/libhive directory, we’d finish the extension installing hdfs_fdw like this:

export PATH=/usr/lib/postgresql/9.5/bin:$PATH

cd ..
sed -i 's%/usr/local/thrift%/opt/thrift%' Makefile
make -j2 USE_PGXS=1
sudo -E make install

We needed the -E flag to preserve our $PATH variable. Otherwise the root user’s path would be used, and then the extension would be installed into the most recent Postgres version, regardless of our wishes.

Since we modified ldconfig with the new library references, we also need to restart Postgres. Otherwise, it won’t have /opt/thrift/lib in its library cache, and as a result, would throw an error when trying to activate hdfs_fdw. Ubuntu systems use pg_ctlcluster for this, while others will use the Postgres pg_ctl tool directly. Let’s make that our final step to “activate” the library before using it.

sudo pg_ctlcluster 9.5 main restart

And finally… finally we’re done. But did it work? Let’s check:

CREATE EXTENSION hdfs_fdw;

CREATE SERVER hdfs_server
         FOREIGN DATA WRAPPER hdfs_fdw
         OPTIONS (host 'hive_host');

CREATE USER MAPPING FOR postgres
    SERVER hdfs_server;

CREATE FOREIGN TABLE hive_sensor_log
(
  id            BIGINT,
  location      VARCHAR(255),
  reading       BIGINT,
  reading_date  TIMESTAMP

) SERVER hdfs_server
OPTIONS (dbname 'default', table_name 'sensor_log');

SELECT * FROM hive_sensor_log;

 id | location | reading |    reading_date     
----+----------+---------+---------------------
  1 | place    |      82 | 2016-06-17 08:15:31
  2 | stuff    |      22 | 2016-06-17 08:18:31

Well then, that was quite an adventure. In the end, we got something that worked, though the amount of hoops we had to jump through was a little disconcerting. It shouldn’t have to be this way.

On a personal note, this was actually the easy part. Hadoop is a truculent beast, and other elements in the stack—of which there are many—just make it more ridiculous. Hive itself is probably one of the most janky things I’ve ever encountered. Postgres has libpq, so why isn’t there an equivalent for Hive? Is it a protocol or not? Why do I need to install a freaking one-off Facebook library to access my Hadoop install with a SQL interface?

Worse, I needed to follow multiple incomplete tutorials online to get Hive working at all. Beyond simply installing it, it must be started with SASL disabled for hdfs_fdw. But doing that means it defaults to Kerberos authentication. If that isn’t set up, commands need to run as the user that launched Hive, since users are mapped from the operating system. To get that to work, I had to modify several more XML files. I feel like I’d need to read at least three books on this subject before even trying to approach this with any amount of confidence.

Then Hive crashed with an OOM error after the query output above. It turns out I could select data to my heart’s content, but following an insert through Hive (which took almost 20 seconds per row), there were ceaseless problems. Pulling data after an insert always caused it to go into an infinite uncaught OOM exception crash loop that required kill -9 to stop. When all of my work for this article was complete, I shut it all down and backed away slowly, lest it rend me to dripping gobbets because I pressed the wrong key in its presence.

Postgres, despite its versatility, just works. It can be used in more advanced architectures and be leveraged for even more power, but it still functions in its base configuration. After today, I sincerely wish more projects followed that philosophy.