PG Phriday: A Postgres Perspective on MongoDB

Friday July 15, 2016

I’ve been almost exclusively a Postgres DBA for a seemingly interminable length of time. While this is good for specializing, nobody wants to be a One-Trick Pony. And aside from the occasional bit of Python to write more advanced tools when Bash isn’t up to the job, it’s All Postgres All The Time. While few things would make me happier, it pays to branch out occasionally.

When NoSQL databases hit the scene a few years ago, I pretty much ignored them wholesale. I read about what they did and how they did it, and while intriguing, none of that seemed particularly better than what Postgres delivered. Cassandra could scale outwards, but has no real concept of NULL, and limits searches to “indexed” columns. Yet sites like Reddit prove how well it can scale when properly leveraged. MongoDB is basically a native JavaScript filesystem, which I ignored because Postgres support of JSON and JSONB effectively deprecated it before it was even on my radar. There are others of course, in CouchDB, Redis, HBase, and a plethora of alternatives.

There’s only so much time in a day though, and being as we use MongoDB for a couple of projects, it only makes sense to investigate how it really works. What can a Postgres user, who eats, sleeps, and breathes Postgres, hope to learn from a NoSQL database? I guess we can start with some simple and basic operations. Can I make a table, fill it with 1-million rows, update a few, delete a few, add an extra index for future searches, and perform a query or two?

Let’s start with the Postgres schema I use for practically everything:

CREATE TABLE sensor_log (
  id            SERIAL NOT NULL PRIMARY KEY,
  location      VARCHAR NOT NULL,
  reading       BIGINT NOT NULL,
  reading_date  TIMESTAMP NOT NULL
);

\timing

INSERT INTO sensor_log (id, location, reading, reading_date)
SELECT s.id, s.id % 1000, s.id % 100,
       CURRENT_DATE - ((s.id * 10) || 's')::INTERVAL
  FROM generate_series(1, 1000000) s(id);

Time: 7236.906 ms

CREATE INDEX idx_sensor_log_date
    ON sensor_log (reading_date);

Time: 804.044 ms

Perhaps the most appropriate question to ask at this junction would be: what is the MongoDB equivalent? Well, as strange as this might sound, there isn’t one. When I described MongoDB as a native JavaScript filesystem, I was being literal. There really are no tables, schemas, or databases as we might think of them. A MongoDB “collection” can be loosely interpreted as a table, in that it stores documents that can be thought of as rows, but that’s where the similarity ends.

A MongoDB collection has no structure at all. Let that sink in a bit. MongoDB collections can not be declared, and as such, literally anything can be stored in one. Check this out:

use mydb

db.myCol.insert( {animal: "cow", says: "moo"} )
db.myCol.insert( {length: 12, seven: [ 8, 9 ]} )

db.myCol.find()

{ "_id" : ObjectId("5788fc361b81473a2b3d1998"), "animal" : "cow", "says" : "moo" }
{ "_id" : ObjectId("5788fc3a1b81473a2b3d1999"), "length" : 12, "seven" : [ 8, 9 ] }

While it’s in our best interests to keep similar documents restricted to certain collections, there’s no kind of constraint that enforces this. This makes it easy to keep tangentially related objects together, with dozens or hundreds of non-contiguous attributes, so long as there is some common expected core. Yet, we don’t have to. As a consequence, we can’t actually declare a predefined structure, required keys, or any other kind of constraint.

In fact, we can learn quite a lot from that previous code block. First of all, the “mydb” database was never defined. Neither was “myCol”, the collection used to store our bogus documents. Simply storing a document made them exist. It might be better to think of MongoDB databases and collections as root folders and subdirectories in a filesystem. When a document is stored, it triggers the creation of any necessary parent elements. Once instantiated, the database and collection will persist after being emptied unless explicitly dropped.

Another thing that’s hard to miss, is the presence of the _id field in our output. All MongoDB documents require one of these, and if we don’t specify one, MongoDB gracefully provides it. Of course, the default MongoDB injects is of limited utility due to its arbitrary nature, but that means we can override that behavior.

With all of that in mind, let’s create the same table and fill it with sample data:

use mydb

var currentDate = new Date();
currentDate.setHours(0,0,0,0)
var batch = new Array()

var start = new Date()
for (var i = 1; i <= 1000000; i++) {
  batch[i] = {
        _id: i,
        location: i % 1000,
        reading: i % 100,
        readingDate: new Date(currentDate - i * 10000)
    }
}

db.sensorLog.insert(batch)

(new Date() - start) / 1000

31.791

start = new Date()
db.sensorLog.ensureIndex( { readingDate: 1 } )
(new Date() - start) / 1000

3.2

We can see that for small data like this, Postgres is a bit more time-efficient for allocating new data. If we examine the filesystem, it’s also apparent this fake sensor data requires 448MB on MongoDB, while Postgres stored everything in 93MB. This is mostly because MongoDB manages storage by preallocating large segments under the assumption the empty space will soon be filled.

In any case, we learn even more from all of the above code. One benefit of MongoDB being a JavaScript native engine, is that we can directly manipulate objects using standard JavaScript syntax. It’s also convenient the insert method allows passing document arrays, as inserting them one-by-one was orders of magnitude slower. Unfortunately, that means the mongo shell doesn’t provide automatic timing output like psql provides.

On the other hand, MongoDB allows us to define whole libraries of local functions that would only be valid for the current session, giving us macro potential Postgres users can only imagine. The necessity of wrapping everything in some kind of SQL-valid statement can often be a hindrance.

Of course, a natural implication of using a JavaScript Engine is that object names are case sensitive. This is a given for nearly any language aside from SQL, but some of us will need to note the context change or we can get ourselves in trouble. This is another good reason to always name Postgres database objects in lowercase and use underscores. We don’t want to learn any inconsistent capitalization habits that may adversely affect us in other systems.

Let’s check out a few statements using Postgres and our sensor_log table.

\timing

UPDATE sensor_log
   SET reading = reading + 1
 WHERE reading_date >= CURRENT_DATE - INTERVAL '8 days'
   AND reading_date < CURRENT_DATE - INTERVAL '7 days';

Time: 114.671 ms

DELETE FROM sensor_log
 WHERE reading_date >= CURRENT_DATE - INTERVAL '9 days'
   AND reading_date < CURRENT_DATE - INTERVAL '8 days';

Time: 16.248 ms

SELECT COUNT(*) FROM sensor_log;

 count  
--------
 991360
(1 row)

Time: 111.820 ms

SELECT *
  FROM sensor_log
 WHERE reading_date < CURRENT_DATE - INTERVAL '2 week'
 ORDER BY reading_date ASC
 LIMIT 5
OFFSET 20;

   id   | location | reading |    reading_date     
--------+----------+---------+---------------------
 999980 | 980      |      80 | 2016-03-21 06:16:40
 999979 | 979      |      79 | 2016-03-21 06:16:50
 999978 | 978      |      78 | 2016-03-21 06:17:00
 999977 | 977      |      77 | 2016-03-21 06:17:10
 999976 | 976      |      76 | 2016-03-21 06:17:20
(5 rows)

Time: 1.034 ms

Let’s examine each MongoDB equivalent, with the timings in milliseconds. This first one is a translation of our UPDATE statement:

var currentDate = new Date()
currentDate.setHours(0,0,0,0)

start = new Date()
db.sensorLog.update({
    readingDate: {
        $gte: new Date(currentDate.getTime() - 8*86400000),
        $lt: new Date(currentDate.getTime() - 7*86400000)
    }
  },
  { $inc: { reading: 1 } },
  { multi: true }
)
(new Date() - start)

WriteResult({ "nMatched" : 8640, "nUpserted" : 0, "nModified" : 8640 })
77

It’s a bit faster than our Postgres version, but a bit more inconvenient to write. One universal element with MongoDB is the increased complexity of statements. The primary tradeoff for having such convenient access to JavaScript in MongoDB, is that everything must be a JavaScript object. Much like Postgres forces us to wrap everything in a SQL-based statement, MongoDB demands even the simplest query invokes a JavaScript function.

In the case of the update function, the first parameter is the search, and the second is the list of changes we want to make. It’s here where we encounter constructs like $inc to increment the value of a field. There are several of these specialized operators we need for manipulating our data.

The last parameter to update is a series of flags that modify how the update operates. For instance, we enabled multi, which updates all matching rows, overriding the default of updating a single row. That seems fairly nonintuitive to a longtime user of SQL-based systems.

Let’s examine how deletes work:

start = new Date()
db.sensorLog.remove({
    readingDate: {
        $gte: new Date(currentDate.getTime() - 9*86400000),
        $lt: new Date(currentDate.getTime() - 8*86400000)
    }
  }
)
(new Date() - start)

WriteResult({ "nRemoved" : 8640 })
145

While the update was a bit faster than Postgres, the delete was much slower. Beyond that, this is very similar to the update we did earlier. Often in MongoDB, the first parameter to a function will be some kind of query in the form of a JSON object. We should also point out that the default operation of the remove function is to act on all matches, the exact opposite of how update works. There is an argument to only remove the first match, but we didn’t use it here.

Let’s keep going and check out how queries themselves work:

start = new Date()
db.sensorLog.count()
(new Date() - start)

2

start = new Date()
db.sensorLog.find({
    readingDate: {
        $lt: new Date(currentDate.getTime() - 14*86400000)
    }
  }
).sort({readingDate: 1}).skip(20).limit(5)
(new Date() - start)

{ "_id" : 999980, "location" : 980, "reading" : 80, "readingDate" : ISODate("2016-03-21T11:16:40Z") }
{ "_id" : 999979, "location" : 979, "reading" : 79, "readingDate" : ISODate("2016-03-21T11:16:50Z") }
{ "_id" : 999978, "location" : 978, "reading" : 78, "readingDate" : ISODate("2016-03-21T11:17:00Z") }
{ "_id" : 999977, "location" : 977, "reading" : 77, "readingDate" : ISODate("2016-03-21T11:17:10Z") }
{ "_id" : 999976, "location" : 976, "reading" : 76, "readingDate" : ISODate("2016-03-21T11:17:20Z") }

2

Obtaining counts is refreshingly simple. One of the cool things about functional approaches like this, is that operations are easily chainable. We can either use the count function directly, or tack it to the end to return the total matches for a long query. We can see such combinations in the second query with the inclusion of skip and limit where Postgres would use OFFSET and LIMIT.

There are a lot of differences here. Diverse tools are a great resource for varying project requirements. It should be fairly clear that MongoDB is geared toward tight integration into JavaScript projects. The incredibly loose document structure requirements make it a perfect target for storing polymorphic data that consists of highly varying fields or otherwise convoluted structure. Usually we would prefer to abstract such things into a relational schema suitable for a standard database like Postgres. Yet in the case of highly nested object representations, that might not be practical.

Despite this, there is some small resemblance between Postgres and MongoDB. Mongo even provides a mapping to relate similar concepts. While multiple methodologies have a suitable equivalent, others do not. We already mentioned that collections can be treated like tables, but since their structure is dictated by their contents, that’s not strictly true.

Other concepts like JOIN didn’t even exist until the latest MongoDB incarnation of 3.2. Even then, that functionality is only available during aggregation. The only other alternative is to embed nested queries within a foreach loop of a base result set. There’s actually a good reason for that, too. MongoDB is a document store, and as such, one could argue that elements we’re trying to look up are essential attributes of the underlying document. This is a world that operates in the land of denormalization, where doing otherwise is an impediment to the platform.

There’s always a bit of a “not quite” caveat when comparing elements between these two engines. I certainly learned a lot while digging around in a default MongoDB instance. While some features were cool and extremely convenient, others frustrated and confounded me. It’s certainly a different world over there in NoSQL land, yet the need for it is undeniable. This and many other reasons contributed to JSON and JSONB becoming integral parts of Postgres. With them, we get the best of both worlds; loose structure amorphic documents and relational logic to cement relationships and enforce at least some integrity within our datasets.

It’s a bit of a foreign concept and technically breaks normalization rules, but I’m willing to consider the potential benefits. Rules are, after all, made to be broken.

See Also