Foreign Keys are Not Free

May 14th, 2014 | Published in Database, Tech Talk | 6 Comments


PostgreSQL is a pretty good database, and I enjoy working with it. However, there is an implementation detail that not everyone knows about, which can drastically affect table performance. What is this mysterious feature? I am, of course, referring to foreign keys.

Foreign keys are normally a part of good database design, and for good reason. They inform about entity relationships, and they verify, enforce, and maintain those relationships. Yet all of this comes at a cost that might surprise you. In PostgreSQL, every foreign key is maintained with an invisible system-level trigger added to the source table in the reference. At least one trigger must go here, as operations that modify the source data must be checked that they do not violate the constraint.

This query is an easy way to see how many foreign keys are associated with every table in an entire PostgreSQL database:

SELECT t.oid::regclass::text AS table_name, count(1) AS total
  FROM pg_constraint c
  JOIN pg_class t ON (t.oid = c.confrelid)
 GROUP BY table_name
 ORDER BY total DESC;

With this in mind, consider how much overhead each trigger incurs on the referenced table. We can actually calculate this overhead. Consider this function:

CREATE OR REPLACE FUNCTION fnc_check_fk_overhead(key_count INT)
RETURNS VOID AS
$$
DECLARE
  i INT;
BEGIN
  CREATE TABLE test_fk
  (
    id   BIGINT PRIMARY KEY,
    junk VARCHAR
  );

  INSERT INTO test_fk
  SELECT generate_series(1, 100000), repeat(' ', 20);

  CLUSTER test_fk_pkey ON test_fk;

  FOR i IN 1..key_count LOOP
    EXECUTE 'CREATE TABLE test_fk_ref_' || i || 
            ' (test_fk_id BIGINT REFERENCES test_fk (id))';
  END LOOP;

  FOR i IN 1..100000 LOOP
    UPDATE test_fk SET junk = '                    '
     WHERE id = i;
  END LOOP;

  DROP TABLE test_fk CASCADE;

  FOR i IN 1..key_count LOOP
    EXECUTE 'DROP TABLE test_fk_ref_' || i;
  END LOOP;

END;
$$ LANGUAGE plpgsql VOLATILE;

The function is designed to create a simple two-column table, fill it with 100,000 records, and test how long it takes to update every record. This is purely meant to simulate a high-transaction load caused by multiple clients. I know no sane developer would actually update so many records this way.

The only parameter this function accepts, is the amount of tables it should create that reference this source table. Every referring table is empty, and has only one column for the reference to be valid. After the foreign key tables are created, it performs those 100,000 updates, and we can measure the output with our favorite SQL tool. Here is a quick test with psql:

\timing
SELECT fnc_check_fk_overhead(0);
SELECT fnc_check_fk_overhead(5);
SELECT fnc_check_fk_overhead(10);
SELECT fnc_check_fk_overhead(15);
SELECT fnc_check_fk_overhead(20);

On our system, these timings were collected several times, and averaged 2961ms, 3805ms, 4606ms, 5089ms, and 5785ms after three runs each. As we can see, after merely five foreign keys, performance of our updates drops by 28.5%. By the time we have 20 foreign keys, the updates are 95% slower!

I don’t mention this to make you abandon foreign keys. However, if you are in charge of an extremely active OLTP system, you might consider removing any non-critical FK constraints. If the values are merely informative, or will not cause any integrity concerns, a foreign key is not required. Indeed, excessive foreign keys are actually detrimental to the database in a very tangible way.

I merely ask you keep this in mind when designing or revising schemas for yourself or developers you support.


Tags: , , ,

6 Comments

Feed
  1. Jason says:

    May 14th, 2014 at 5:25 pm [#]


    Maybe it’s just me, but “the updates are 95% slower” sounds like 5% throughput compared to the original. But it’s more like 50% throughput.

    In any case, interesting post.


  2. xor says:

    May 14th, 2014 at 11:41 pm [#]


    I think to test only UPDATE performance, you need exclude from this test CREATE and DROP reference TABLES! Because create 5 tables is not same as create 10,15,20 tables :)


    1. Shaun says:

      May 15th, 2014 at 8:06 am [#]


      Modifying the function isn’t that difficult. Here’s a version that does just as you suggest:

      CREATE OR REPLACE FUNCTION fnc_check_fk_overhead(key_count INT)
      RETURNS INTERVAL AS
      $$
      DECLARE
        i INT;
        start_time TIMESTAMP;
        end_time TIMESTAMP;
      BEGIN
        CREATE TABLE test_fk
        (
          id   BIGINT PRIMARY KEY,
          junk VARCHAR
        );
      
        INSERT INTO test_fk
        SELECT generate_series(1, 100000), repeat(' ', 20);
      
        CLUSTER test_fk_pkey ON test_fk;
      
        FOR i IN 1..key_count LOOP
          EXECUTE 'CREATE TABLE test_fk_ref_' || i || 
                  ' (test_fk_id BIGINT REFERENCES test_fk (id) ON UPDATE NO ACTION)';
        END LOOP;
      
        start_time = clock_timestamp();
      
        FOR i IN 1..100000 LOOP
          UPDATE test_fk SET junk = '                    '
           WHERE id = i;
        END LOOP;
      
        end_time = clock_timestamp();
      
        DROP TABLE test_fk CASCADE;
      
        FOR i IN 1..key_count LOOP
          EXECUTE 'DROP TABLE test_fk_ref_' || i;
        END LOOP;
      
        RETURN end_time - start_time;
      
      END;
      $$ LANGUAGE plpgsql VOLATILE;
      

      The results are similar. It’s very clear there is quite a bit of overhead associated with foreign key triggers.


  3. Luca Veronese says:

    May 15th, 2014 at 11:08 am [#]


    I can confirm the performance penalty myself from tests I made on my use cases. I have real cases where foreign key checks impact is 60% of total execution time. The implementation via triggers is probably the culprit since the check seems to be done one row at a time. Postgres should collect the updated rows and should process the constraint in a set oriented manner, using low level index accesses instead of recursive sql calls (as it seems it does given the performance penalty). Disclaimer: I did not read the source code and may be wrong. Anyway this is an area that needs improvement (more than JSON support IMHO).


    1. Luca Veronese says:

      May 15th, 2014 at 11:20 am [#]


      My post above was about INSERT performance on tables that reference many other tables. I’m sorry, your test was very different. What I find really strange is that you are not updating the primary key of test_fk so you should not see any side effect due to the presence of the foreign keys. If your findings are confirmed then I think a patch should be made to check if updates on the master table are performed on the primary key columns (or any unique key referenced in foreign key constraints for the matter). Yet I find it really strange that PG developers did not catch this big inefficiency before.


      1. Shaun says:

        May 15th, 2014 at 2:01 pm [#]


        I didn’t quite get this either. I’ve seen problems with cascaded updates or deletes due to missing indexes, but this is something entirely different and completely unexpected. Key updates are extremely unlikely in environments that utilize surrogate keys, so adding this kind of overhead to every trigger call is baffling.


Sorry, this post is closed to comments.