Friends Don't Let Friends Use Loops

Friday July 25, 2014

Programming is fun. I love programming! Ever since I changed my career from programming to database work, I’ve still occasionally dabbled in my former craft. As such, I believe I can say this with a fair amount of accuracy: programmers don’t understand databases. This isn’t something small, either; there’s a fundamental misunderstanding at play. Unless the coder happens to work primarily with graphics, bulk set-based transformations are not something they’ll generally work with.

For instance, if tasked with inserting ten thousand records into a database table, a programmer might simply open the data source and insert them one by one. Consider this basic (non-normalized) table with a couple basic indexes:

CREATE TABLE sensor_log (
  sensor_log_id    SERIAL PRIMARY KEY,
  location         VARCHAR NOT NULL,
  reading          BIGINT NOT NULL,
  reading_date     TIMESTAMP NOT NULL
);

CREATE INDEX idx_sensor_log_location ON sensor_log (location);
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);

Now suppose we have a file with ten thousand lines of something like this:

38c-1401,293857,2014-07-25 10:18:38-05:00
69a-8921,209574,2014-07-25 10:18:25-05:00
9e5-0942,690134,2014-07-25 10:18:16-05:00

To load this data, our coder chooses Python and whips up an insert script. Let’s even give the programmer the benefit of the doubt, and say they know that prepared queries are faster due to less overhead. I see scripts like this all the time, written in languages from Java to Erlang. This one is no different:

import psycopg2

db_conn = psycopg2.connect(database = 'postgres', user = 'postgres')
cur = db_conn.cursor()

cur.execute(
  """PREPARE log_add AS
     INSERT INTO sensor_log (location, reading, reading_date)
     VALUES ($1, $2, $3);"""
)

file_input = open('/tmp/input.csv', 'r')
for line in file_input:
    cur.execute("EXECUTE log_add(%s, %s, %s)", line.strip().split(','))
file_input.close()

cur.execute("DEALLOCATE log_add");

db_conn.commit()
db_conn.close()

It’s unlikely we have the /tmp/input.csv file itself, but we can generate one. Suppose we have 100 locations each with 100 sensors. We could produce a fake input file with this SQL:

COPY (
    SELECT substring(md5((a.id % 100)::TEXT), 1, 3) || '-' ||
           to_char(a.id % 100, 'FM0999') AS location,
           (a.id * random() * 1000)::INT AS reading,
           now() - (a.id % 60 || 's')::INTERVAL AS reading_date
      FROM generate_series(1, 10000) a (id)
) TO '/tmp/input.csv' WITH CSV;

Whew! That was a lot of work. Now, let’s see what happens when we time the inserts on an empty import table:

real    0m1.162s
user    0m0.168s
sys     0m0.122s

Well, a little over one second isn’t that bad. But suppose we rewrote the python script a bit. Bear with me; I’m going to be silly and use the python script as a simple pass-through. This should simulate a process that applies transformations and outputs another file for direct database import. Here we go:

import psycopg2

file_input = open('/tmp/input.csv', 'r')
processed = open('/tmp/processed.csv', 'w+')

for line in file_input:
    parts = line.strip().split(',')
    processed.write(','.join(parts) + '\n')

file_input.close()
processed.seek(0)

db_conn = psycopg2.connect(database = 'postgres', user = 'postgres')
cur = db_conn.cursor()
cur.copy_from(
    processed, 'sensor_log', ',',
    columns = ('location', 'reading', 'reading_date')
)

processed.close()

db_conn.commit()
db_conn.close()

Now let’s look at the timings involved again:

real    0m0.365s
user    0m0.056s
sys     0m0.004s

That’s about three times faster! Considering how simple this example is, that’s actually pretty drastic. We don’t have many indexes, the table has few columns, and the number of rows is relatively small. The situation gets far worse as all of those things increase.

It’s also not the end of our story. What happens if we enable autocommit, so that each insert gets its own transaction? Some ORMs might do this, or a naive developer might try generating a single script full of insert statements, and not know much about transactions. Let’s see:

real    1m31.794s
user    0m0.233s
sys     0m0.172s

Oh. Ouch. What normally takes around a third of a second can balloon all the way out to a minute and a half. This is one of the reasons I strongly advocate educating developers on proper data import techniques. It’s one thing for a job to be three to five times slower due to inefficient design. It’s quite another to be nearly 250 times slower simply because a programmer believes producing the output file was fast, so logically, inserting it should be similarly speedy. Both scenarios can be avoided by educating anyone involved with data manipulation.

This doesn’t just apply to new hires. Keeping everyone up to date on new techniques is equally important, as are refresher courses. I care about my database, so when possible, I try to over-share as much information as I can manage. I even wrote and presented several talks which I periodically give to our application developers to encourage better database use. Our company Wiki is similarly full of information, which I also present on occasion, if only because reading technical manuals can be quite boring.

If my database is being abused, it’s my job as a DBA to try and alleviate the situation any way I can. Sometimes, that means telling people what they’re doing wrong, and how they can fix it. I certainly didn’t know all of this ten years ago when I was primarily a coder. But I would have appreciated being pointed in the right direction by someone with more experience in the field.

Your database and users deserve it.

See Also