Friends Don’t Let Friends Use Loops

July 25th, 2014 | Published in Database, Programming, Tech Talk | 14 Comments


Programming is fun. I love programming! Ever since I changed my career from programming to database work, I’ve still occasionally dabbled in my former craft. As such, I believe I can say this with a fair amount of accuracy: programmers don’t understand databases. This isn’t something small, either; there’s a fundamental misunderstanding at play. Unless the coder happens to work primarily with graphics, bulk set-based transformations are not something they’ll generally work with.

For instance, if tasked with inserting ten thousand records into a database table, a programmer might simply open the data source and insert them one by one. Consider this basic (non-normalized) table with a couple basic indexes:

CREATE TABLE sensor_log (
  sensor_log_id    SERIAL PRIMARY KEY,
  location         VARCHAR NOT NULL,
  reading          BIGINT NOT NULL,
  reading_date     TIMESTAMP NOT NULL
);

CREATE INDEX idx_sensor_log_location ON sensor_log (location);
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);

Now suppose we have a file with ten thousand lines of something like this:

38c-1401,293857,2014-07-25 10:18:38-05:00
69a-8921,209574,2014-07-25 10:18:25-05:00
9e5-0942,690134,2014-07-25 10:18:16-05:00

To load this data, our coder chooses Python and whips up an insert script. Let’s even give the programmer the benefit of the doubt, and say they know that prepared queries are faster due to less overhead. I see scripts like this all the time, written in languages from Java to Erlang. This one is no different:

import psycopg2

db_conn = psycopg2.connect(database = 'postgres', user = 'postgres')
cur = db_conn.cursor()

cur.execute(
  """PREPARE log_add AS
     INSERT INTO sensor_log (location, reading, reading_date)
     VALUES ($1, $2, $3);"""
)

file_input = open('/tmp/input.csv', 'r')
for line in file_input:
    cur.execute("EXECUTE log_add(%s, %s, %s)", line.strip().split(','))
file_input.close()

cur.execute("DEALLOCATE log_add");

db_conn.commit()
db_conn.close()

It’s unlikely we have the /tmp/input.csv file itself, but we can generate one. Suppose we have 100 locations each with 100 sensors. We could produce a fake input file with this SQL:

COPY (
    SELECT substring(md5((a.id % 100)::TEXT), 1, 3) || '-' ||
           to_char(a.id % 100, 'FM0999') AS location,
           (a.id * random() * 1000)::INT AS reading,
           now() - (a.id % 60 || 's')::INTERVAL AS reading_date
      FROM generate_series(1, 10000) a (id)
) TO '/tmp/input.csv' WITH CSV;

Whew! That was a lot of work. Now, let’s see what happens when we time the inserts on an empty import table:

real    0m1.162s
user    0m0.168s
sys     0m0.122s

Well, a little over one second isn’t that bad. But suppose we rewrote the python script a bit. Bear with me; I’m going to be silly and use the python script as a simple pass-through. This should simulate a process that applies transformations and outputs another file for direct database import. Here we go:

import psycopg2

file_input = open('/tmp/input.csv', 'r')
processed = open('/tmp/processed.csv', 'w+')

for line in file_input:
    parts = line.strip().split(',')
    processed.write(','.join(parts) + '\n')

file_input.close()
processed.seek(0)

db_conn = psycopg2.connect(database = 'postgres', user = 'postgres')
cur = db_conn.cursor()
cur.copy_from(
    processed, 'sensor_log', ',',
    columns = ('location', 'reading', 'reading_date')
)

processed.close()

db_conn.commit()
db_conn.close()

Now let’s look at the timings involved again:

real    0m0.365s
user    0m0.056s
sys     0m0.004s

That’s about three times faster! Considering how simple this example is, that’s actually pretty drastic. We don’t have many indexes, the table has few columns, and the number of rows is relatively small. The situation gets far worse as all of those things increase.

It’s also not the end of our story. What happens if we enable autocommit, so that each insert gets its own transaction? Some ORMs might do this, or a naive developer might try generating a single script full of insert statements, and not know much about transactions. Let’s see:

real    1m31.794s
user    0m0.233s
sys     0m0.172s

Oh. Ouch. What normally takes around a third of a second can balloon all the way out to a minute and a half. This is one of the reasons I strongly advocate educating developers on proper data import techniques. It’s one thing for a job to be three to five times slower due to inefficient design. It’s quite another to be nearly 250 times slower simply because a programmer believes producing the output file was fast, so logically, inserting it should be similarly speedy. Both scenarios can be avoided by educating anyone involved with data manipulation.

This doesn’t just apply to new hires. Keeping everyone up to date on new techniques is equally important, as are refresher courses. I care about my database, so when possible, I try to over-share as much information as I can manage. I even wrote and presented several talks which I periodically give to our application developers to encourage better database use. Our company Wiki is similarly full of information, which I also present on occasion, if only because reading technical manuals can be quite boring.

If my database is being abused, it’s my job as a DBA to try and alleviate the situation any way I can. Sometimes, that means telling people what they’re doing wrong, and how they can fix it. I certainly didn’t know all of this ten years ago when I was primarily a coder. But I would have appreciated being pointed in the right direction by someone with more experience in the field.

Your database and users deserve it.


Tags: , , , ,

14 Comments

Feed
  1. Adrian Klaver says:

    July 25th, 2014 at 7:33 pm [#]


    Should this: “What happens if we disable autocommit, so that each insert gets its own transaction?”

    not be:

    What happens if we enable autocommit, so that each insert gets its own transaction?


    1. Shaun says:

      July 26th, 2014 at 11:25 am [#]


      Oops! Fixed.


  2. Darren Duncan says:

    July 26th, 2014 at 2:36 am [#]


    Typo: Don’t you mean to say “What happens if we ENable autocommit, so that each insert gets its own transaction?”


  3. PostgreSQL anti-patterns: read-modify-write cycles says:

    July 26th, 2014 at 4:06 am [#]


    [...] Shaun Thomas’s recent post about client-side loops as an SQL anti-pattern is well worth a read if you’re relatively new to SQL-based application development. [...]


  4. John Fawcett says:

    July 30th, 2014 at 2:29 pm [#]


    Nice! I wrote a little bit about Postgres Copy in node.js a while back http://blog.j0.hn/post/80266933245/stream-all-the-things-like-data-into-postgres I should have put some performance comparisons in there though :)

    It might be worth noting that all data imported from COPY is atomic and if one part of the stream fails, then none of the data is imported.


    1. Shaun says:

      July 30th, 2014 at 2:37 pm [#]


      True regarding the atomic action. There are ways to get around that, though. Primarily, staging loads into temporary tables can solve several issues. This is especially true if loading into extremely large tables or tables with a lot of foreign keys or triggers that can drastically increase insert time. Bulk process into the staging table, then perform an atomic commit at the end.

      Maybe I’ll cover that in a subsequent post.


  5. Anko says:

    July 30th, 2014 at 9:17 pm [#]


    I didn’t know about that API for database access before, so thanks for that.

    But please don’t ever parse a csv using split. Use a csv library! I could break your code with a simple \, in the data.

    It really hurts that your premise is that programmers don’t understand databases, and then you go on to show that DBAs don’t understand programming. If the database has an api which is basically a leaky abstraction, that’s a problem with the database, not the programmer.


    1. Shaun says:

      July 30th, 2014 at 11:03 pm [#]


      Anko, if you take these short examples as some kind of statement regarding my coding abilities or background, you’ve missed the entire point of this article. Unless you’ve worked in the industry and helped a developer transform an hour-long insert loop into a five minute bulk load, you have no idea what you’re talking about. My goal wasn’t to attack coder abilities, but to advocate DBAs spread database-centric knowledge to the developers they work with every day.

      It’s unfortunate you chose to take the more pessimistic interpretation.


      1. -dg says:

        August 5th, 2014 at 4:36 pm [#]


        … if you take these short examples as some kind of statement regarding my coding abilities or background, you’ve missed the entire point of this article…

        Shaun, instead of defensively trying to parry Anko’s valid critique of your example wouldn’t it be better to simply fix the example? You could use the python csv module or instead of csv use tsv with eg: ‘\t’.join(fields) (assuming no embedded tabs in the data).

        When your topic is you people are doing it wrong, here is the right way, you owe it to your readers to actually do and show things the right way.


        1. Shaun says:

          August 5th, 2014 at 5:11 pm [#]


          Because it’s still not the point. Yes, I could have used the CSV module, but something tells me it has more overhead than what amounts to an effective no-op.

          In reality, I don’t suggest writing a CSV parser by hand in any case. There are ETL tools like talend for that. But I encounter problems like this often enough, I had to say something.


  6. Scott says:

    July 31st, 2014 at 11:10 am [#]


    This article is really good. Thanks!

    I have a question about doing this with Java. I have seen bulk file load options in some JDBC drivers, but it is very vendor specific. Do you know if there is a generic way to do this kind of task?


  7. Jaap Aap says:

    July 31st, 2014 at 6:32 pm [#]


    I think the title of this article should read “Friends Don’t Let Friends Use Relational Databases”


    1. Shaun says:

      July 31st, 2014 at 9:14 pm [#]


      Well, relational data structures are based on the same set theories that make graphics work, but you know what? You’re absolutely right! Down with faster set-based operations! Who needs to avoid duplicating data? Have a million things that use a name you just changed? Screw it! Just change the million things instead of the common name they use! Efficiency!


  8. miniminiyo says:

    August 1st, 2014 at 7:41 am [#]


    Thats right, db conections and use should be careful and not make with basic methods, because db increase ever… and its gona be a trouble if you leave performance in lazzy methods.


Sorry, this post is closed to comments.