PG Phriday: Instant Runoff Through SQL
The United States held an election recently, and there has been some … mild controversy regarding the results. Many raised issues about this before the election itself, but what if we had used instant-runoff voting instead? More importantly, can we implement it with Postgres?
Well, the answer to the last question is a strong affirmative. So long as we don’t break the results down into voting districts, and make wild unsupported assumptions regarding rankings, that is. But what’s a little inaccuracy in a demo?
Let’s start with the base tables:
CREATE TABLE irv_election
(
election_id SERIAL PRIMARY KEY,
election_name VARCHAR NOT NULL
);
CREATE INDEX idx_election_election_name
ON irv_election (election_name);
CREATE TABLE irv_candidate
(
candidate_id SERIAL PRIMARY KEY,
election_id INT NOT NULL REFERENCES irv_election,
candidate_name VARCHAR NOT NULL
);
CREATE INDEX idx_candidate_candidate_name
ON irv_candidate (candidate_name);
CREATE TABLE irv_vote
(
vote_id SERIAL PRIMARY KEY,
election_id INT NOT NULL REFERENCES irv_election,
candidate_list INT[] NOT NULL
);
CREATE INDEX idx_vote_election_id
ON irv_vote (election_id);
CREATE TABLE irv_scratch
(
candidate_name VARCHAR,
votes BIGINT,
percent NUMERIC,
my_round INT
);
Technically we could use this voting engine for the entire election, but we’d need to register each voting district separately. Ain’t nobody got time for that. Instead, we’ll just assume the popular vote works for the whole thing. These are the latest election results for the four major candidates while this post was being written:
- Donald Trump - 60,071,650
- Hillary Clinton - 60,467,245
- Gary Johnson - 4,123,062
- Jill Stein - 1,237,116
Given that our lame tabulation system hasn’t integrated the Electoral College, it’s winner-take-all, and therefore Clinton is the next President! Never mind that the margin of error results in a statistical tie; math doesn’t matter. Yet we still need an API for data entry to simplify everything.
To do that, we need three functions. One to register the election, one for candidates, and one for incoming votes. Here’s what those might look like:
CREATE OR REPLACE FUNCTION start_election(
election VARCHAR
)
RETURNS VOID AS
$$
INSERT INTO irv_election (election_name)
VALUES (election);
$$ LANGUAGE SQL;
CREATE OR REPLACE FUNCTION register_candidate(
election VARCHAR,
candidate VARCHAR
)
RETURNS VOID AS
$$
INSERT INTO irv_candidate (election_id, candidate_name)
SELECT election_id, candidate
FROM irv_election
WHERE election_name = election;
$$ LANGUAGE SQL;
CREATE OR REPLACE FUNCTION register_vote(
election VARCHAR,
candidates VARCHAR[]
)
RETURNS VOID AS
$$
WITH ranked AS (
SELECT candidate_name, row_number() OVER () AS rank
FROM unnest(candidates) candidate_name
),
conv AS (
SELECT c.election_id, c.candidate_id, r.rank
FROM irv_candidate c
JOIN ranked r USING (candidate_name)
ORDER BY rank
)
INSERT INTO irv_vote (election_id, candidate_list)
SELECT election_id, array_agg(candidate_id)
FROM conv
GROUP BY 1;
$$ LANGUAGE SQL;
Fancy! The start_election
and register_candidate
functions are pretty obvious, but what on Earth is going on in the register_vote
routine?
Since we decided to use an array for the candidate vote ranking, we need to decode the candidate names into their corresponding IDs while maintaining the original ordering. There may be a better way to do this, but we started with a CTE to rank the candidates, another to convert the IDs, and a final aggregate to recombine the translated values. We could have avoided all of this complexity by simply using the text provided by the voter, but that’s too easy.
With the standard API out of the way, we still need to tabulate the votes themselves. IRV is a simple concept, and pretty easy to implement. It’s just a bit ugly:
CREATE OR REPLACE FUNCTION no_losers(orig INT[], remove INT[])
RETURNS INT[] AS
$$
DECLARE
item INT;
ret_arr INT[] := orig;
BEGIN
IF array_length(remove, 1) IS NULL THEN
RETURN ret_arr;
END IF;
FOR item IN 1 .. array_length(remove, 1) LOOP
ret_arr := ret_arr - remove[item];
END LOOP;
RETURN ret_arr;
END;
$$ LANGUAGE PLPGSQL IMMUTABLE STRICT;
CREATE OR REPLACE FUNCTION tabulate_votes(
election VARCHAR
)
RETURNS VOID AS
$$
DECLARE
num_candidates INT;
num_votes NUMERIC;
vote_round INT;
losers INT[] := '{}'::INT[]; -- Non-null empty array.
winner VARCHAR;
results RECORD;
BEGIN
DROP TABLE IF EXISTS working_round;
CREATE TABLE working_round (LIKE irv_scratch);
-- Given the number of candidates in this election, we can
-- reduce calculations.
SELECT count(*) INTO num_candidates
FROM irv_candidate c
JOIN irv_election e USING (election_id)
WHERE e.election_name = election;
-- We need the total number of votes cast for display
-- percentages.
SELECT count(*) INTO num_votes
FROM irv_vote v
JOIN irv_election e USING (election_id)
WHERE e.election_name = election;
-- Record each round of the IRV process. At the end of each round,
-- eliminate the worst performer and try again. Do this until we
-- reach > 50% for one candidate.
FOR vote_round IN 1 .. (num_candidates - 1) LOOP
RAISE NOTICE '== Round % ==', vote_round;
INSERT INTO working_round
SELECT c.candidate_name, count(v.vote_id) AS votes,
round(count(v.vote_id) /
num_votes * 100.0, 2) AS percent,
vote_round AS my_round
FROM irv_vote v
JOIN irv_candidate c ON (
c.candidate_id = (no_losers(v.candidate_list, losers))[1]
)
JOIN irv_election e ON (e.election_id = v.election_id)
WHERE e.election_name = election
GROUP BY c.candidate_name;
-- Print the results of the round for spot-checking.
FOR results IN
SELECT * FROM working_round
WHERE my_round = vote_round
LOOP
RAISE NOTICE '%: % (%)', results.candidate_name,
results.votes, results.percent;
END LOOP;
-- If this round has a winner, short-circuit so we can
-- just report the results.
SELECT candidate_name INTO winner
FROM working_round
WHERE percent > 50
AND my_round = vote_round;
IF winner IS NOT NULL THEN
RAISE NOTICE 'Winner of % is %!', election, winner;
EXIT;
END IF;
-- This is where we determine the loser of this round.
-- It's just the lowest ranked result.
SELECT array_append(losers, c.candidate_id) INTO losers
FROM working_round w
JOIN irv_candidate c USING (candidate_name)
WHERE my_round = vote_round
ORDER BY w.votes
LIMIT 1;
END LOOP;
END;
$$ LANGUAGE PLPGSQL;
The no_loser
function is necessary due to a small issue with Postgres array handling. Primarily, we can’t subtract an array from another array. Imagine we maintain a list of losers for each subsequent IRV round. Totals would normally be a simple matter of obtaining the first element of the ranked candidates minus the list of losers. Even the intarray extension won’t help us, as it reorders the results, defeating the purpose of ranking them. Our function doesn’t have that problem, but it could probably be more efficient. We marked it as IMMUTABLE
and STRICT
as optimization flags, as static parameters will always produce the same results.
The tabulate_votes
function is pretty hefty, but it does everything we need. This is where the “instant” in instant-runoff occurs. Since the candidates are pre-ranked, we can just use the data we already have in order to obtain a majority winner. Really we’re just running a plurality vote over and over again after slowly killing off the losers; it’s a macabre kind of mathematical evolution.
Now we just need some data. It’s probably possible to download the results from some website somewhere, but we’re already pretty inaccurate. It’s not possible to derive voter rank intent from that data in any case. Instead, we’ll just take the vote totals and reduce them by a factor of 1000 to make the election quick.
Regarding the rankings, we have to make a few decisions that are fairly coarse. For now, let’s just take a liberal versus conservative approach. To do that, we’ll assume Stein voters would choose Clinton as their second choice, and vice-versa. Trump and Johnson are the same story. This is a really polarized election, so neither group would debase themselves by humoring the other candidates.
SELECT start_election('Pres 2016');
SELECT register_candidate('Pres 2016', 'Clinton');
SELECT register_candidate('Pres 2016', 'Trump');
SELECT register_candidate('Pres 2016', 'Stein');
SELECT register_candidate('Pres 2016', 'Johnson');
SELECT register_vote('Pres 2016', '{Clinton, Stein}')
FROM generate_series(1, 60467);
SELECT register_vote('Pres 2016', '{Stein, Clinton}')
FROM generate_series(1, 1237);
SELECT register_vote('Pres 2016', '{Trump, Johnson}')
FROM generate_series(1, 60071);
SELECT register_vote('Pres 2016', '{Johnson, Trump}')
FROM generate_series(1, 4123);
SELECT tabulate_votes('Pres 2016');
NOTICE: == Round 1 ==
NOTICE: Clinton: 60467 (48.03)
NOTICE: Stein: 1237 (0.98)
NOTICE: Trump: 60071 (47.71)
NOTICE: Johnson: 4123 (3.27)
NOTICE: == Round 2 ==
NOTICE: Clinton: 61704 (49.01)
NOTICE: Trump: 60071 (47.71)
NOTICE: Johnson: 4123 (3.27)
NOTICE: == Round 3 ==
NOTICE: Clinton: 61704 (49.01)
NOTICE: Trump: 64194 (50.99)
NOTICE: Winner of Pres 2016 is Trump!
And there you have it. Using completely invalid assumptions for a voting system we will never implement, Trump is the winner of the popular vote!
The important lesson in all of this is partially of scale. If we really had entered all of the data for every district into this system, we would have over 125-million rows. We could transfer all of that data to an external reporting system for tabulation, or we can just handle that process locally. The caller of tabulate_votes
can check the working_round
table to directly observe the results of each round and handle orders of magnitude less data.
We also used a pretty crazy range of techniques here. CTEs, window functions, functional hints, array manipulation, debug notices, and temporary tables are just a tiny fragment of what’s really available.
In the end we can all agree that the true winner of the election was Postgres.