PG Phriday: CONFLICT of Interests

Friday March 24, 2017

MySQL has had a REPLACE INTO syntax to perform “UPSERT” logic since practically the very beginning. For the longest time, users who wanted to switch to Postgres, but for whatever reason relied on this functionality, were essentially trapped. Postgres 9.5 changed all that, but why did it take so long? As with much of Postgres history, it’s a long story.

To really understand where Postgres started, we need to look at the “old” way of handling a row merge. Many in the database world have probably encountered this once or twice:

CREATE TABLE my_tab (a INT PRIMARY KEY, b TEXT);

CREATE FUNCTION merge_tab(k INT, v TEXT)
RETURNS VOID AS
$$
BEGIN
  LOOP
    -- First, try to update the key.

    UPDATE my_tab SET b = v
     WHERE a = k;

    IF FOUND THEN
      EXIT;
    END IF;

    -- The key doesn't exist, so try to insert it.

    BEGIN
      INSERT INTO my_tab (a, b) VALUES (k, v);
      EXIT;

    EXCEPTION WHEN unique_violation THEN
      -- Nothing here, allow the loop to continue.
    END;
  END LOOP;

END;
$$ LANGUAGE plpgsql;

SELECT merge_tab(1, 'James');
SELECT merge_tab(1, 'Jimmy');

What on Earth is all of that? Oddly enough, the somewhat convoluted logic is not only sound, it’s actually required to avoid a race condition. In the microseconds between attempting our UPDATE and following to the INSERT, some other transaction may have inserted the “missing” key. In that case, we’d encounter a unique constraint violation.

By catching the exception, we’re not immediately kicked out of the function and are presented with a choice. Do we assume our value is “right” and repeat the loop to apply the update, or just exit silently under the assumption that the successful transaction that beat us is probably fine? This particular function selected the previous assertion because that’s what a merge or upsert tries to guarantee: that the requested action is applied. Were we to omit the loop, the exception block would ensure there was no conflict or fatal error, but we could no longer rely on the function operating as advertised.

So why not invert the logic and remove the loop entirely? After all, we could just attempt the insert and if it fails, perform the update within the exception block, right? Actually no. Consider what happens if the target key is deleted by a concurrent transaction. Say we try our insert, and in the space of time between the key violation and our update, it gets deleted. Suddenly our update also produces an error. That’s probably an extremely unlikely edge case, but in OLTP databases, the unlikely becomes frighteningly common. So to be safe, we’re stuck with the loop.

That is a lot of overhead for what many consider basic functionality. Since that’s no longer a concern, let’s take a look at the actual syntax the Postgres team selected. To do that, let’s start with a very basic table with a handful of rows:

CREATE TABLE symbol_mapper (
  vendor_id    BIGINT   NOT NULL,
  ext_mapping  VARCHAR  NOT NULL,
  symbol       VARCHAR  NOT NULL,
  PRIMARY KEY (vendor_id, ext_mapping)
);

INSERT INTO symbol_mapper VALUES (1, 'Google', 'GOOGL');
INSERT INTO symbol_mapper VALUES (1, 'Apple', 'AAPL');
INSERT INTO symbol_mapper VALUES (2, 'goo', 'GOOGL');
INSERT INTO symbol_mapper VALUES (2, 'app', 'AAPL');

ANALYZE symbol_mapper;

The purpose of a mapping table is to fill the role of decoding external names or lookup values to match internal ones. Since each vendor may have its own designation structure, we require a mapping for each. That also protects us in case two vendors use the same identifiers.

So far we have a fairly standard application of tables. Now let’s do something interesting:

INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol)
       VALUES (2, 'app', 'AAPL')
    ON CONFLICT DO NOTHING;

INSERT 0 0

In this particular case, the “aap” mapping already exists for vendor 2, so no insert takes place. This is the equivalent of not wrapping our insert/update with a loop. We don’t care what the value is, just as long as something is there. In reality, this is more of a way to remove error output from violations than anything immediately useful.

The real fun doesn’t start until we integrate the DO UPDATE functionality. Consider the case where we want to add Samsung as a mapping for a vendor. In this particular case, someone sneaked an existing row into the system, and it contains a typo.

INSERT INTO symbol_mapper VALUES (1, 'Samsung', 'SSLNF');

INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol)
       VALUES (1, 'Samsung', 'SSNLF')
    ON CONFLICT (vendor_id, ext_mapping)
    DO UPDATE
   SET symbol = 'SSNLF';

INSERT 0 1

SELECT * FROM symbol_mapper;

 vendor_id | ext_mapping | symbol 
-----------+-------------+--------
         1 | Google      | GOOGL
         1 | Apple       | AAPL
         2 | goo         | GOOGL
         2 | app         | AAPL
         1 | Samsung     | SSNLF

What we’ve done here is ensure the newest incoming mapping is the “correct” one; that’s our merge. An observant reader might ask how this is any different from our futile desire to attempt an INSERT with an UPDATE in an exception block. Unlike those two separate statements and the time-consuming exception handling, this is a single atomic action.

Did another session delete the row before us? We’ll just insert it again. Did another transaction delete the row we just inserted or updated? Oh well. The important part is that it is impossible to delete the row while our statement is running. So while the logic is similar to using an exception, the difference is that DO UPDATE is built into the database itself, so it can’t be broken into multiple actions that can be interrupted.

Another interesting bit of syntax is that we can actually incorporate a WHERE clause into the update beyond the implicit assumption that our update affects the same key we tried to insert.

INSERT INTO symbol_mapper VALUES (1, 'Sony', 'SONY');

INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol)
       VALUES (1, 'Sony', 'SNY')
    ON CONFLICT (vendor_id, ext_mapping)
    DO UPDATE
   SET symbol = 'SNY'
 WHERE symbol_mapper.symbol = 'SONY';

INSERT 0 1

INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol)
       VALUES (1, 'Sony', 'SNY')
    ON CONFLICT (vendor_id, ext_mapping)
    DO UPDATE
   SET symbol = 'SNY'
 WHERE symbol_mapper.symbol = 'SONY';

INSERT 0 0

In this case, we had a mapping for Sony that needed a correction. The first query affected the row we targeted, and the second did nothing. This is important because if we had not specified that predicate, both updates would have successfully modified the row. And so would all subsequent attempts. Remember Postgres keeps a row version for every update, even if the new and old values are identical. That’s just how MVCC works.

In a loosely built application environment, it isn’t uncommon for several vectors to operate simultaneously. If a dozen of these each upsert the same value, they’ll all be satisfied that their work is complete, and Postgres would be stuck with a dozen duplicate old rows. VACUUM (and autovacuum) will ensure old row versions are recycled, but again, that’s more overhead we don’t need to invoke.

And of course, the WHERE clause isn’t restricted to deflecting repeated update attempts. There may be circumstances where we simply don’t want to apply changes. By specifying the table name, we can introspect into any of the existing table values. What about the values we attempted to insert? Since these were part of an inherent violation, they’re assigned to a record named “excluded”.

Here it is in action:

INSERT INTO symbol_mapper VALUES (1, 'Microsoft', 'msft');

INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol)
       VALUES (1, 'Microsoft', 'MSFT')
    ON CONFLICT ON CONSTRAINT symbol_mapper_pkey
    DO UPDATE
   SET symbol = 'MSFT'
 WHERE symbol_mapper.symbol != excluded.symbol;

This is a very similar situation as we had with Sony. The mapping for Microsoft needs an update if the existing value doesn’t match the one we’re attempting to insert. Well, we can perform that check explicitly without hard-coding those values into the query multiple times. It’s possible to refer to anything in the VALUES tuple by specifying “excluded”. Handy, eh?

Also notice that we’ve changed our conflict condition. Previously we had simply listed the columns in the primary key, and Postgres inferred the proper constraint from that definition. In this case, we directly stated the constraint that Postgres should use in resolving the conflict. It’s somewhat uncommon (and probably not entirely safe) to directly invoke constraint names, but the option is there in case we want it.

This feature was a long time coming; Postgres 9.5 was released in early 2016. As impossible as it sounds, we’ve only really had a little over a year to leverage ON CONFLICT. As a consequence, it’s still slowly seeping into existing Postgres application stacks. Users are still incorporating it into their workflows. It’ll be a while before it’s taken for granted with the rest of the kitchen sink Postgres offers.

Until then, it’s that special toy we’ve always wanted but couldn’t afford until now. There’s nothing else to do but make up for lost time!

P.S. Before I forget, Postgres Open is getting ready for 2017! If you or someone else wants to attend or even submit a talk, I highly recommend doing so. I’ve been something of a regular fixture there since it started in 2011, and I fully intend to try my hand again. It’s taking place in San Francisco this time around, so the change in venue will definitely be interesting. Apparently it’s all part of the new United States PostgreSQL Association, so it’s good seeing everything pull together behind a united entity. Here’s looking to the future!

See Also