Friends Don't Let Friends Use Loops
Programming is fun. I love programming! Ever since I changed my career from programming to database work, I’ve still occasionally dabbled in my former craft. As such, I believe I can say this with a fair amount of accuracy: programmers don’t understand databases. This isn’t something small, either; there’s a fundamental misunderstanding at play. Unless the coder happens to work primarily with graphics, bulk set-based transformations are not something they’ll generally work with.
For instance, if tasked with inserting ten thousand records into a database table, a programmer might simply open the data source and insert them one by one. Consider this basic (non-normalized) table with a couple basic indexes:
CREATE TABLE sensor_log (
sensor_log_id SERIAL PRIMARY KEY,
location VARCHAR NOT NULL,
reading BIGINT NOT NULL,
reading_date TIMESTAMP NOT NULL
);
CREATE INDEX idx_sensor_log_location ON sensor_log (location);
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);
Now suppose we have a file with ten thousand lines of something like this:
38c-1401,293857,2014-07-25 10:18:38-05:00
69a-8921,209574,2014-07-25 10:18:25-05:00
9e5-0942,690134,2014-07-25 10:18:16-05:00
To load this data, our coder chooses Python and whips up an insert script. Let’s even give the programmer the benefit of the doubt, and say they know that prepared queries are faster due to less overhead. I see scripts like this all the time, written in languages from Java to Erlang. This one is no different:
import psycopg2
db_conn = psycopg2.connect(database = 'postgres', user = 'postgres')
cur = db_conn.cursor()
cur.execute(
"""PREPARE log_add AS
INSERT INTO sensor_log (location, reading, reading_date)
VALUES ($1, $2, $3);"""
)
file_input = open('/tmp/input.csv', 'r')
for line in file_input:
cur.execute("EXECUTE log_add(%s, %s, %s)", line.strip().split(','))
file_input.close()
cur.execute("DEALLOCATE log_add");
db_conn.commit()
db_conn.close()
It’s unlikely we have the /tmp/input.csv
file itself, but we can generate one. Suppose we have 100 locations each with 100 sensors. We could produce a fake input file with this SQL:
COPY (
SELECT substring(md5((a.id % 100)::TEXT), 1, 3) || '-' ||
to_char(a.id % 100, 'FM0999') AS location,
(a.id * random() * 1000)::INT AS reading,
now() - (a.id % 60 || 's')::INTERVAL AS reading_date
FROM generate_series(1, 10000) a (id)
) TO '/tmp/input.csv' WITH CSV;
Whew! That was a lot of work. Now, let’s see what happens when we time the inserts on an empty import table:
real 0m1.162s
user 0m0.168s
sys 0m0.122s
Well, a little over one second isn’t that bad. But suppose we rewrote the python script a bit. Bear with me; I’m going to be silly and use the python script as a simple pass-through. This should simulate a process that applies transformations and outputs another file for direct database import. Here we go:
import psycopg2
file_input = open('/tmp/input.csv', 'r')
processed = open('/tmp/processed.csv', 'w+')
for line in file_input:
parts = line.strip().split(',')
processed.write(','.join(parts) + '\n')
file_input.close()
processed.seek(0)
db_conn = psycopg2.connect(database = 'postgres', user = 'postgres')
cur = db_conn.cursor()
cur.copy_from(
processed, 'sensor_log', ',',
columns = ('location', 'reading', 'reading_date')
)
processed.close()
db_conn.commit()
db_conn.close()
Now let’s look at the timings involved again:
real 0m0.365s
user 0m0.056s
sys 0m0.004s
That’s about three times faster! Considering how simple this example is, that’s actually pretty drastic. We don’t have many indexes, the table has few columns, and the number of rows is relatively small. The situation gets far worse as all of those things increase.
It’s also not the end of our story. What happens if we enable autocommit, so that each insert gets its own transaction? Some ORMs might do this, or a naive developer might try generating a single script full of insert statements, and not know much about transactions. Let’s see:
real 1m31.794s
user 0m0.233s
sys 0m0.172s
Oh. Ouch. What normally takes around a third of a second can balloon all the way out to a minute and a half. This is one of the reasons I strongly advocate educating developers on proper data import techniques. It’s one thing for a job to be three to five times slower due to inefficient design. It’s quite another to be nearly 250 times slower simply because a programmer believes producing the output file was fast, so logically, inserting it should be similarly speedy. Both scenarios can be avoided by educating anyone involved with data manipulation.
This doesn’t just apply to new hires. Keeping everyone up to date on new techniques is equally important, as are refresher courses. I care about my database, so when possible, I try to over-share as much information as I can manage. I even wrote and presented several talks which I periodically give to our application developers to encourage better database use. Our company Wiki is similarly full of information, which I also present on occasion, if only because reading technical manuals can be quite boring.
If my database is being abused, it’s my job as a DBA to try and alleviate the situation any way I can. Sometimes, that means telling people what they’re doing wrong, and how they can fix it. I certainly didn’t know all of this ten years ago when I was primarily a coder. But I would have appreciated being pointed in the right direction by someone with more experience in the field.
Your database and users deserve it.