I’ve been almost exclusively a Postgres DBA for a seemingly interminable length of time. While this is good for specializing, nobody wants to be a One-Trick Pony. And aside from the occasional bit of Python to write more advanced tools when Bash isn’t up to the job, it’s All Postgres All The Time. While few things would make me happier, it pays to branch out occasionally.

When NoSQL databases hit the scene a few years ago, I pretty much ignored them wholesale. I read about what they did and how they did it, and while intriguing, none of that seemed particularly better than what Postgres delivered. Cassandra could scale outwards, but has no real concept of NULL, and limits searches to “indexed” columns. Yet sites like Reddit prove how well it can scale when properly leveraged. MongoDB is basically a native JavaScript filesystem, which I ignored because Postgres support of JSON and JSONB effectively deprecated it before it was even on my radar. There are others of course, in CouchDB, Redis, HBase, and a plethora of alternatives.

There’s only so much time in a day though, and being as we use MongoDB for a couple of projects, it only makes sense to investigate how it really works. What can a Postgres user, who eats, sleeps, and breathes Postgres, hope to learn from a NoSQL database? I guess we can start with some simple and basic operations. Can I make a table, fill it with 1-million rows, update a few, delete a few, add an extra index for future searches, and perform a query or two?

Let’s start with the Postgres schema I use for practically everything:

CREATE TABLE sensor_log (
  id            SERIAL NOT NULL PRIMARY KEY,
  location      VARCHAR NOT NULL,
  reading       BIGINT NOT NULL,
  reading_date  TIMESTAMP NOT NULL
);
 
\timing
 
INSERT INTO sensor_log (id, location, reading, reading_date)
SELECT s.id, s.id % 1000, s.id % 100,
       CURRENT_DATE - ((s.id * 10) || 's')::INTERVAL
  FROM generate_series(1, 1000000) s(id);
 
TIME: 7236.906 ms
 
CREATE INDEX idx_sensor_log_date
    ON sensor_log (reading_date);
 
TIME: 804.044 ms

Perhaps the most appropriate question to ask at this junction would be: what is the MongoDB equivalent? Well, as strange as this might sound, there isn’t one. When I described MongoDB as a native JavaScript filesystem, I was being literal. There really are no tables, schemas, or databases as we might think of them. A MongoDB “collection” can be loosely interpreted as a table, in that it stores documents that can be thought of as rows, but that’s where the similarity ends.

A MongoDB collection has no structure at all. Let that sink in a bit. MongoDB collections can not be declared, and as such, literally anything can be stored in one. Check this out:

use mydb
 
db.myCol.insert( {animal: "cow", says: "moo"} )
db.myCol.insert( {length: 12, seven: [ 8, 9 ]} )
 
db.myCol.find()
 
{ "_id" : ObjectId("5788fc361b81473a2b3d1998"), "animal" : "cow", "says" : "moo" }
{ "_id" : ObjectId("5788fc3a1b81473a2b3d1999"), "length" : 12, "seven" : [ 8, 9 ] }

While it’s in our best interests to keep similar documents restricted to certain collections, there’s no kind of constraint that enforces this. This makes it easy to keep tangentially related objects together, with dozens or hundreds of non-contiguous attributes, so long as there is some common expected core. Yet, we don’t have to. As a consequence, we can’t actually declare a predefined structure, required keys, or any other kind of constraint.

In fact, we can learn quite a lot from that previous code block. First of all, the “mydb” database was never defined. Neither was “myCol”, the collection used to store our bogus documents. Simply storing a document made them exist. It might be better to think of MongoDB databases and collections as root folders and subdirectories in a filesystem. When a document is stored, it triggers the creation of any necessary parent elements. Once instantiated, the database and collection will persist after being emptied unless explicitly dropped.

Another thing that’s hard to miss, is the presence of the _id field in our output. All MongoDB documents require one of these, and if we don’t specify one, MongoDB gracefully provides it. Of course, the default MongoDB injects is of limited utility due to its arbitrary nature, but that means we can override that behavior.

With all of that in mind, let’s create the same table and fill it with sample data:

use mydb
 
var currentDate = new Date();
currentDate.setHours(0,0,0,0)
var batch = new Array()
 
var start = new Date()
for (var i = 1; i <= 1000000; i++) {
  batch[i] = {
        _id: i,
        location: i % 1000,
        reading: i % 100,
        readingDate: new Date(currentDate - i * 10000)
    }
}
 
db.sensorLog.insert(batch)
 
(new Date() - start) / 1000
 
31.791
 
start = new Date()
db.sensorLog.ensureIndex( { readingDate: 1 } )
(new Date() - start) / 1000
 
3.2

We can see that for small data like this, Postgres is a bit more time-efficient for allocating new data. If we examine the filesystem, it’s also apparent this fake sensor data requires 448MB on MongoDB, while Postgres stored everything in 93MB. This is mostly because MongoDB manages storage by preallocating large segments under the assumption the empty space will soon be filled.

In any case, we learn even more from all of the above code. One benefit of MongoDB being a JavaScript native engine, is that we can directly manipulate objects using standard JavaScript syntax. It’s also convenient the insert method allows passing document arrays, as inserting them one-by-one was orders of magnitude slower. Unfortunately, that means the mongo shell doesn’t provide automatic timing output like psql provides.

On the other hand, MongoDB allows us to define whole libraries of local functions that would only be valid for the current session, giving us macro potential Postgres users can only imagine. The necessity of wrapping everything in some kind of SQL-valid statement can often be a hindrance.

Of course, a natural implication of using a JavaScript Engine is that object names are case sensitive. This is a given for nearly any language aside from SQL, but some of us will need to note the context change or we can get ourselves in trouble. This is another good reason to always name Postgres database objects in lowercase and use underscores. We don’t want to learn any inconsistent capitalization habits that may adversely affect us in other systems.

Let’s check out a few statements using Postgres and our sensor_log table.

\timing
 
UPDATE sensor_log
   SET reading = reading + 1
 WHERE reading_date >= CURRENT_DATE - INTERVAL '8 days'
   AND reading_date < CURRENT_DATE - INTERVAL '7 days';
 
TIME: 114.671 ms
 
DELETE FROM sensor_log
 WHERE reading_date >= CURRENT_DATE - INTERVAL '9 days'
   AND reading_date < CURRENT_DATE - INTERVAL '8 days';
 
TIME: 16.248 ms
 
SELECT COUNT(*) FROM sensor_log;
 
 COUNT  
--------
 991360
(1 ROW)
 
TIME: 111.820 ms
 
SELECT *
  FROM sensor_log
 WHERE reading_date < CURRENT_DATE - INTERVAL '2 week'
 ORDER BY reading_date ASC
 LIMIT 5
OFFSET 20;
 
   id   | location | reading |    reading_date     
--------+----------+---------+---------------------
 999980 | 980      |      80 | 2016-03-21 06:16:40
 999979 | 979      |      79 | 2016-03-21 06:16:50
 999978 | 978      |      78 | 2016-03-21 06:17:00
 999977 | 977      |      77 | 2016-03-21 06:17:10
 999976 | 976      |      76 | 2016-03-21 06:17:20
(5 ROWS)
 
TIME: 1.034 ms

Let’s examine each MongoDB equivalent, with the timings in milliseconds. This first one is a translation of our UPDATE statement:

var currentDate = new Date()
currentDate.setHours(0,0,0,0)
 
start = new Date()
db.sensorLog.update({
    readingDate: {
        $gte: new Date(currentDate.getTime() - 8*86400000),
        $lt: new Date(currentDate.getTime() - 7*86400000)
    }
  },
  { $inc: { reading: 1 } },
  { multi: true }
)
(new Date() - start)
 
WriteResult({ "nMatched" : 8640, "nUpserted" : 0, "nModified" : 8640 })
77

It’s a bit faster than our Postgres version, but a bit more inconvenient to write. One universal element with MongoDB is the increased complexity of statements. The primary tradeoff for having such convenient access to JavaScript in MongoDB, is that everything must be a JavaScript object. Much like Postgres forces us to wrap everything in a SQL-based statement, MongoDB demands even the simplest query invokes a JavaScript function.

In the case of the update function, the first parameter is the search, and the second is the list of changes we want to make. It’s here where we encounter constructs like $inc to increment the value of a field. There are several of these specialized operators we need for manipulating our data.

The last parameter to update is a series of flags that modify how the update operates. For instance, we enabled multi, which updates all matching rows, overriding the default of updating a single row. That seems fairly nonintuitive to a longtime user of SQL-based systems.

Let’s examine how deletes work:

start = new Date()
db.sensorLog.remove({
    readingDate: {
        $gte: new Date(currentDate.getTime() - 9*86400000),
        $lt: new Date(currentDate.getTime() - 8*86400000)
    }
  }
)
(new Date() - start)
 
WriteResult({ "nRemoved" : 8640 })
145

While the update was a bit faster than Postgres, the delete was much slower. Beyond that, this is very similar to the update we did earlier. Often in MongoDB, the first parameter to a function will be some kind of query in the form of a JSON object. We should also point out that the default operation of the remove function is to act on all matches, the exact opposite of how update works. There is an argument to only remove the first match, but we didn’t use it here.

Let’s keep going and check out how queries themselves work:

start = new Date()
db.sensorLog.count()
(new Date() - start)
 
2
 
start = new Date()
db.sensorLog.find({
    readingDate: {
        $lt: new Date(currentDate.getTime() - 14*86400000)
    }
  }
).sort({readingDate: 1}).skip(20).limit(5)
(new Date() - start)
 
{ "_id" : 999980, "location" : 980, "reading" : 80, "readingDate" : ISODate("2016-03-21T11:16:40Z") }
{ "_id" : 999979, "location" : 979, "reading" : 79, "readingDate" : ISODate("2016-03-21T11:16:50Z") }
{ "_id" : 999978, "location" : 978, "reading" : 78, "readingDate" : ISODate("2016-03-21T11:17:00Z") }
{ "_id" : 999977, "location" : 977, "reading" : 77, "readingDate" : ISODate("2016-03-21T11:17:10Z") }
{ "_id" : 999976, "location" : 976, "reading" : 76, "readingDate" : ISODate("2016-03-21T11:17:20Z") }
 
2

Obtaining counts is refreshingly simple. One of the cool things about functional approaches like this, is that operations are easily chainable. We can either use the count function directly, or tack it to the end to return the total matches for a long query. We can see such combinations in the second query with the inclusion of skip and limit where Postgres would use OFFSET and LIMIT.

There are a lot of differences here. Diverse tools are a great resource for varying project requirements. It should be fairly clear that MongoDB is geared toward tight integration into JavaScript projects. The incredibly loose document structure requirements make it a perfect target for storing polymorphic data that consists of highly varying fields or otherwise convoluted structure. Usually we would prefer to abstract such things into a relational schema suitable for a standard database like Postgres. Yet in the case of highly nested object representations, that might not be practical.

Despite this, there is some small resemblance between Postgres and MongoDB. Mongo even provides a mapping to relate similar concepts. While multiple methodologies have a suitable equivalent, others do not. We already mentioned that collections can be treated like tables, but since their structure is dictated by their contents, that’s not strictly true.

Other concepts like JOIN didn’t even exist until the latest MongoDB incarnation of 3.2. Even then, that functionality is only available during aggregation. The only other alternative is to embed nested queries within a foreach loop of a base result set. There’s actually a good reason for that, too. MongoDB is a document store, and as such, one could argue that elements we’re trying to look up are essential attributes of the underlying document. This is a world that operates in the land of denormalization, where doing otherwise is an impediment to the platform.

There’s always a bit of a “not quite” caveat when comparing elements between these two engines. I certainly learned a lot while digging around in a default MongoDB instance. While some features were cool and extremely convenient, others frustrated and confounded me. It’s certainly a different world over there in NoSQL land, yet the need for it is undeniable. This and many other reasons contributed to JSON and JSONB becoming integral parts of Postgres. With them, we get the best of both worlds; loose structure amorphic documents and relational logic to cement relationships and enforce at least some integrity within our datasets.

It’s a bit of a foreign concept and technically breaks normalization rules, but I’m willing to consider the potential benefits. Rules are, after all, made to be broken.

PG Phriday: A Postgres Perspective on MongoDB
Tagged on:             

17 thoughts on “PG Phriday: A Postgres Perspective on MongoDB

  • I would be very interested in you running these number and updating it covering all the engines of Mongo which operate differently from data access to storage. In fact, one of the engines is pure HEAP while another is LSM. There are four engines used in Percona Server for MongoDB, for example. Each has different storage patterns and requirements that drastically change the disk used and responsivness of the various crude operations. To this point, you didn’t even call out what engine your metrics were on how I suspect if you used RockDB from Facebook you update/delete/insert numbers would be massively different 🙂

    1. Huh. I didn’t even know there were other back-ends; I haven’t gotten that far in the documentation yet. That’ll definitely be on my list of things to try out.

  • Nice write up.

    A few notes to consider: – 3.2 supports document validation, which helps for restricting inserts and updates: https://docs.mongodb.com/manual/core/document-validation/ – I would be curious to see if using WiredTiger with compression would help your database size issue (I’m not sure if you were using WT for this article, default in 3.2) – Just like Postgres, querying on indexed fields will be important for performance, even though MongoDB can accept any document structure

  • mongo thoroughly encourages denormalization and embedding of data .

    Try to keep data that belongs together in a single place.

    A classic example maybe to have a single document for a blog post and all the comments are objects in an array. This would represent a 1:Many relationship in RDBMSes

    example illustrating some things that can be done:

    { _id: date: ISODate(‘2016 …’), title: “PH Phriday….”, content: “I’ve been almos…”, comments: [ { author: “David”, date: ISODate(‘2016 …’), content: “I would be ..”, replies: [ {shaun’s comment}]}, {..more comments} ] }

  • Nice write-up! 🙂 Aside from the already mentioned different mongo backends, preallocation strategy (on/off, amounts) can also be modified in the mongod.conf – but as I see it, this is only worth a mention since I assume the postgres setup also runs in its default configuration (?) and is probably out of scope for this article.

    The only real error is regarding the _id: It’s not arbitrary, instead it stores the creation date (only second precision sadly), a counter and some other information (iirc) in a hex string. Nice since it remains sortable and provides a history out of the box. On the phone, otherwise I would post a link for convenience 🙂

    Having said that, since json and jsonb came out, we are primarily using postgres nowadays. While I see remaning use cases for mongo, I find the observation that it was deprecated by these field types pretty spot on – at least for primary application storage.

  • “[MongoDB uses a lot more space than Postgres for the same data] mostly because MongoDB manages storage by preallocating large segments under the assumption the empty space will soon be filled.”

    Er, isn’t it mostly because MongoDB needs to store the key names with each record, while Postgres only needs to store them once with the table?

    1. That might be part of it, but not quite. The MongoDB extents (for the default storage engine at least) are exact powers of two. The three files on disk for this collection are 64MB, 128MB, and 256MB. That’s preallocation behavior if I’ve ever seen it.

      1. The key names are compressed or kept in some kind of index, they aren’t present in full on an individual document.

        That said, it’s definitely preallocation, as you can see quite readily in the documentation. You can customize the exact amount of space to preallocate to several degrees of granularity, even on a per-insert basis.

  • An ObjectId is not arbitrary, but _id is (don’t conflate the two). An _id is an ObjectId by default, but can be absolutely anything.

  • I don’t think anyone is disagreeing with what you are saying here, it’s just that the article stated:

    Of course, the default MongoDB injects is of limited utility due to its arbitrary nature, but that means we can override that behavior.

    Yes, _id can be overridden, but the default behavior is to assign an ObjectId, which isn’t arbitrary.

    1. My point was that, for the purposes of finding documents, it is exactly arbitrary. I don’t care what kind of black-box magic goes into an ID I’m literally never going to use. 🙂

Comments are closed.