PG Phriday: Instant Runoff Through SQL

The United States held an election recently, and there has been some … mild controversy regarding the results. Many raised issues about this before the election itself, but what if we had used instant-runoff voting instead? More importantly, can we implement it with Postgres?

Well, the answer to the last question is a strong affirmative. So long as we don’t break the results down into voting districts, and make wild unsupported assumptions regarding rankings, that is. But what’s a little inaccuracy in a demo?

Let’s start with the base tables:

CREATE TABLE irv_election
(
  election_id    SERIAL PRIMARY KEY,
  election_name  VARCHAR NOT NULL
);

CREATE INDEX idx_election_election_name
    ON irv_election (election_name);

CREATE TABLE irv_candidate
(
  candidate_id    SERIAL PRIMARY KEY,
  election_id     INT NOT NULL REFERENCES irv_election,
  candidate_name  VARCHAR NOT NULL
);

CREATE INDEX idx_candidate_candidate_name
    ON irv_candidate (candidate_name);

CREATE TABLE irv_vote
(
  vote_id         SERIAL PRIMARY KEY,
  election_id     INT NOT NULL REFERENCES irv_election,
  candidate_list  INT[] NOT NULL
);

CREATE INDEX idx_vote_election_id
    ON irv_vote (election_id);

CREATE TABLE irv_scratch
(
  candidate_name  VARCHAR,
  votes           BIGINT,
  percent         NUMERIC,
  my_round        INT
);

Technically we could use this voting engine for the entire election, but we’d need to register each voting district separately. Ain’t nobody got time for that. Instead, we’ll just assume the popular vote works for the whole thing. These are the latest election results for the four major candidates while this post was being written:

  • Donald Trump - 60,071,650
  • Hillary Clinton - 60,467,245
  • Gary Johnson - 4,123,062
  • Jill Stein - 1,237,116

Given that our lame tabulation system hasn’t integrated the Electoral College, it’s winner-take-all, and therefore Clinton is the next President! Never mind that the margin of error results in a statistical tie; math doesn’t matter. Yet we still need an API for data entry to simplify everything.

To do that, we need three functions. One to register the election, one for candidates, and one for incoming votes. Here’s what those might look like:

CREATE OR REPLACE FUNCTION start_election(
  election VARCHAR
)
RETURNS VOID AS
$$
  INSERT INTO irv_election (election_name)
  VALUES (election);
$$ LANGUAGE SQL;


CREATE OR REPLACE FUNCTION register_candidate(
  election VARCHAR,
  candidate VARCHAR
)
RETURNS VOID AS
$$
  INSERT INTO irv_candidate (election_id, candidate_name)
  SELECT election_id, candidate
    FROM irv_election
   WHERE election_name = election;
$$ LANGUAGE SQL;


CREATE OR REPLACE FUNCTION register_vote(
  election VARCHAR,
  candidates VARCHAR[]
)
RETURNS VOID AS
$$
  WITH ranked AS (
    SELECT candidate_name, row_number() OVER () AS rank
      FROM unnest(candidates) candidate_name
  ),
  conv AS (
    SELECT c.election_id, c.candidate_id, r.rank
      FROM irv_candidate c
      JOIN ranked r USING (candidate_name)
     ORDER BY rank
  )
  INSERT INTO irv_vote (election_id, candidate_list)
  SELECT election_id, array_agg(candidate_id)
    FROM conv
   GROUP BY 1;
$$ LANGUAGE SQL;

Fancy! The start_election and register_candidate functions are pretty obvious, but what on Earth is going on in the register_vote routine?

Since we decided to use an array for the candidate vote ranking, we need to decode the candidate names into their corresponding IDs while maintaining the original ordering. There may be a better way to do this, but we started with a CTE to rank the candidates, another to convert the IDs, and a final aggregate to recombine the translated values. We could have avoided all of this complexity by simply using the text provided by the voter, but that’s too easy.

With the standard API out of the way, we still need to tabulate the votes themselves. IRV is a simple concept, and pretty easy to implement. It’s just a bit ugly:

CREATE OR REPLACE FUNCTION no_losers(orig INT[], remove INT[])
RETURNS INT[] AS
$$
DECLARE
  item INT; 
  ret_arr INT[] := orig;
BEGIN
  IF array_length(remove, 1) IS NULL THEN
    RETURN ret_arr;
  END IF;

  FOR item IN 1 .. array_length(remove, 1) LOOP
    ret_arr := ret_arr - remove[item];
  END LOOP;

  RETURN ret_arr;
END;
$$ LANGUAGE PLPGSQL IMMUTABLE STRICT;


CREATE OR REPLACE FUNCTION tabulate_votes(
  election VARCHAR
)
RETURNS VOID AS
$$
DECLARE
  num_candidates INT;
  num_votes NUMERIC;
  vote_round INT;
  losers INT[] := '{}'::INT[]; -- Non-null empty array.
  winner VARCHAR;
  results RECORD;
BEGIN

  DROP TABLE IF EXISTS working_round;
  CREATE TABLE working_round (LIKE irv_scratch);

  -- Given the number of candidates in this election, we can
  -- reduce calculations.

  SELECT count(*) INTO num_candidates
    FROM irv_candidate c
    JOIN irv_election e USING (election_id)
   WHERE e.election_name = election;

  -- We need the total number of votes cast for display
  -- percentages.

  SELECT count(*) INTO num_votes
    FROM irv_vote v
    JOIN irv_election e USING (election_id)
   WHERE e.election_name = election;

  -- Record each round of the IRV process. At the end of each round,
  -- eliminate the worst performer and try again. Do this until we
  -- reach > 50% for one candidate.

  FOR vote_round IN 1 .. (num_candidates - 1) LOOP
    RAISE NOTICE '== Round % ==', vote_round;

    INSERT INTO working_round
    SELECT c.candidate_name, count(v.vote_id) AS votes,
           round(count(v.vote_id) / 
                 num_votes * 100.0, 2) AS percent,
           vote_round AS my_round
      FROM irv_vote v
      JOIN irv_candidate c ON (
             c.candidate_id = (no_losers(v.candidate_list, losers))[1]
           )
      JOIN irv_election e ON (e.election_id = v.election_id)
     WHERE e.election_name = election
     GROUP BY c.candidate_name;

    -- Print the results of the round for spot-checking.

    FOR results IN
        SELECT * FROM working_round
         WHERE my_round = vote_round
    LOOP
      RAISE NOTICE '%: % (%)', results.candidate_name,
        results.votes, results.percent;
    END LOOP;

    -- If this round has a winner, short-circuit so we can
    -- just report the results.

    SELECT candidate_name INTO winner
      FROM working_round
     WHERE percent > 50
       AND my_round = vote_round;

    IF winner IS NOT NULL THEN
      RAISE NOTICE 'Winner of % is %!', election, winner;
      EXIT;
    END IF;

    -- This is where we determine the loser of this round.
    -- It's just the lowest ranked result.

    SELECT array_append(losers, c.candidate_id) INTO losers
      FROM working_round w
      JOIN irv_candidate c USING (candidate_name)
     WHERE my_round = vote_round
     ORDER BY w.votes
     LIMIT 1;

  END LOOP;

END;
$$ LANGUAGE PLPGSQL;

The no_loser function is necessary due to a small issue with Postgres array handling. Primarily, we can’t subtract an array from another array. Imagine we maintain a list of losers for each subsequent IRV round. Totals would normally be a simple matter of obtaining the first element of the ranked candidates minus the list of losers. Even the intarray extension won’t help us, as it reorders the results, defeating the purpose of ranking them. Our function doesn’t have that problem, but it could probably be more efficient. We marked it as IMMUTABLE and STRICT as optimization flags, as static parameters will always produce the same results.

The tabulate_votes function is pretty hefty, but it does everything we need. This is where the “instant” in instant-runoff occurs. Since the candidates are pre-ranked, we can just use the data we already have in order to obtain a majority winner. Really we’re just running a plurality vote over and over again after slowly killing off the losers; it’s a macabre kind of mathematical evolution.

Now we just need some data. It’s probably possible to download the results from some website somewhere, but we’re already pretty inaccurate. It’s not possible to derive voter rank intent from that data in any case. Instead, we’ll just take the vote totals and reduce them by a factor of 1000 to make the election quick.

Regarding the rankings, we have to make a few decisions that are fairly coarse. For now, let’s just take a liberal versus conservative approach. To do that, we’ll assume Stein voters would choose Clinton as their second choice, and vice-versa. Trump and Johnson are the same story. This is a really polarized election, so neither group would debase themselves by humoring the other candidates.

SELECT start_election('Pres 2016');

SELECT register_candidate('Pres 2016', 'Clinton');
SELECT register_candidate('Pres 2016', 'Trump');
SELECT register_candidate('Pres 2016', 'Stein');
SELECT register_candidate('Pres 2016', 'Johnson');

SELECT register_vote('Pres 2016', '{Clinton, Stein}')
  FROM generate_series(1, 60467);

SELECT register_vote('Pres 2016', '{Stein, Clinton}')
  FROM generate_series(1, 1237);

SELECT register_vote('Pres 2016', '{Trump, Johnson}')
  FROM generate_series(1, 60071);

SELECT register_vote('Pres 2016', '{Johnson, Trump}')
  FROM generate_series(1, 4123);

SELECT tabulate_votes('Pres 2016');

NOTICE:  == Round 1 ==
NOTICE:  Clinton: 60467 (48.03)
NOTICE:  Stein: 1237 (0.98)
NOTICE:  Trump: 60071 (47.71)
NOTICE:  Johnson: 4123 (3.27)
NOTICE:  == Round 2 ==
NOTICE:  Clinton: 61704 (49.01)
NOTICE:  Trump: 60071 (47.71)
NOTICE:  Johnson: 4123 (3.27)
NOTICE:  == Round 3 ==
NOTICE:  Clinton: 61704 (49.01)
NOTICE:  Trump: 64194 (50.99)
NOTICE:  Winner of Pres 2016 is Trump!

And there you have it. Using completely invalid assumptions for a voting system we will never implement, Trump is the winner of the popular vote!

The important lesson in all of this is partially of scale. If we really had entered all of the data for every district into this system, we would have over 125-million rows. We could transfer all of that data to an external reporting system for tabulation, or we can just handle that process locally. The caller of tabulate_votes can check the working_round table to directly observe the results of each round and handle orders of magnitude less data.

We also used a pretty crazy range of techniques here. CTEs, window functions, functional hints, array manipulation, debug notices, and temporary tables are just a tiny fragment of what’s really available.

In the end we can all agree that the true winner of the election was Postgres.