PG Phriday: Database Creation Workshop

Friday January 15, 2016

Postgres theory, feature discussion, and advocacy are fun. But even I’ll admit it’s nice to have some practical application every once in a while. This week, we’re going to build an actual database.

But what would be small enough for a proof of concept, yet somewhat interesting? Well, I’m a fan of Hearthstone. It’s a silly card game much like Magic: The Gathering, but has the distinct aura of “eSports!” Regardless, it’s a fun little time waster, and has a few hundred data points we can manipulate.

Annoyingly, there’s no official card list. But fans are nothing if not dedicated, and several folks maintain spreadsheets for their own purposes and often share them. This one had a pretty good breakdown and is up to date with the latest expansions, so it makes a good starting point.

The second page of this Excel workbook is a card list, which I saved as a quoted tab-delimited file. The first three lines are garbage, so away they went. The last four lines are also empty, so into the digital bucket in the sky they passed. What’s left is something that we can import into a junk table:

CREATE UNLOGGED TABLE hs_raw_import (
  junk1 VARCHAR,
  junk2 VARCHAR,
  junk3 VARCHAR,
  junk4 VARCHAR,
  expansion VARCHAR,
  card_name VARCHAR,
  character_class VARCHAR,
  rarity VARCHAR,
  category VARCHAR,
  tribe VARCHAR,
  mana_cost VARCHAR,
  attack VARCHAR,
  hitpoints VARCHAR,
  card_text VARCHAR,
  junk5 VARCHAR,
  junk6 VARCHAR,
  junk7 VARCHAR
);

COPY hs_raw_import FROM '/tmp/hearthstone_cards.tsv'
  WITH CSV DELIMITER E'\t' HEADER;

ANALYZE hs_raw_import;

All of the “junk” columns are an unfortunate reality of the current COPY command; there’s no way to ignore columns in source files. That’s OK in this case because we’re going to discard the import table when everything is done, and it’s certainly speedy. Importing all 743 rows on a test VM required less than 5ms.

Now we need to transform the data into something a bit more normalized. To do that, we need to look at the data. A technique that works really well is to examine how the data is distributed by checking the contents of the pg_stats table. Let’s look at this data:

SELECT attname, n_distinct
  FROM pg_stats
 WHERE tablename='hs_raw_import'
   AND attname NOT LIKE 'junk%';

     attname     | n_distinct 
-----------------+------------
 expansion       |          9
 card_name       |         -1
 hitpoints       |         12
 card_text       |   -0.89502
 character_class |         10
 rarity          |          5
 category        |          3
 tribe           |          7
 mana_cost       |         13
 attack          |         12

Looking at the statistics, any negative number means there’s a proportional relationship between the total row count and the number of distinct values. Positive numbers are absolute counts, and those are the ones we want to focus on for normalization.

After leveraging a bit of knowledge regarding the actual game mechanics, we can ignore elements that act as measurable metrics. Things like hitpoints, attack, and mana_cost may not have a lot of distinct values, but don’t have further associated information. But what about the rest? Let’s dig a little deeper:

expansion : As with most CCGs, new cards and mechanics are added through expansions. These have their own associated attributes we might want to track independently. This one warrants a new table.
character_class : In Hearthstone, cards are separated into nine distinct classes, with a tenth as neutral cards any class can leverage. Again, if we were writing an app, we might list several other data points about each class, so we’d want that in a separate table.
rarity : This one is a bit tricky. There are currently only five rarity levels, each of which has an occurrence percentage when opening new card packs. Yet Blizzard is not likely to add further rarity levels. For now, let’s leave percentages as an academic exercise, and just leave rarity as a text description.
category : Currently there are only minions, spells, and weapons. This is extremely likely to change in the future, assuming the game remains popular for several more decades, as Magic has. This one gets a new table.
tribe : Some players consider this a subcategory. Minions can sometimes be further broken down into a taxonomy that will alter game mechanics via synergistic effects. We definitely want to track this separately.

Not listed here are attributes inherent in the card text itself. The taunt attribute for example, is a significant variable game mechanic, which we’ll want to mine from the text. Since card text is often conditional, these optional elements should also be tracked separately through an attribute table.

Given all of this, here’s an extremely simplified architecture for representing the cards themselves, with ’etc’ representing further descriptive columns. Queries to bootstrap them from the source table are at the end:

CREATE TABLE expansion (
  expansion_id    SERIAL PRIMARY KEY,
  expansion_name  VARCHAR NOT NULL,
  release_date    DATE NULL,
  etc             VARCHAR NULL
);

CREATE TABLE character_class (
  class_id       SERIAL PRIMARY KEY,
  class_name     VARCHAR NOT NULL,
  hero_power     VARCHAR NULL,
  etc            VARCHAR NULL
);

CREATE TABLE category (
  category_id    SERIAL PRIMARY KEY,
  category_name  VARCHAR NOT NULL,
  etc            VARCHAR NULL
);

CREATE TABLE minion_tribe (
  tribe_id    SERIAL PRIMARY KEY,
  tribe_name  VARCHAR NOT NULL,
  etc         VARCHAR NULL
);

CREATE TABLE attribute (
  attribute_id  SERIAL PRIMARY KEY,
  att_name      VARCHAR NOT NULL,
  effect        TEXT NULL,
  etc           VARCHAR NULL
);

CREATE TABLE card (
  card_id       SERIAL PRIMARY KEY,
  expansion_id  INT NOT NULL REFERENCES expansion,
  card_name     VARCHAR NOT NULL,
  class_id      INT NOT NULL REFERENCES character_class,
  rarity        VARCHAR NOT NULL,
  category_id   INT NOT NULL REFERENCES category,
  tribe_id      INT NULL REFERENCES minion_tribe,
  mana_cost     INT NOT NULL,
  attack        INT NULL,
  hitpoints     INT NULL,
  card_text     TEXT
);

CREATE INDEX idx_card_expansion ON card (expansion_id);
CREATE INDEX idx_card_class ON card (class_id);
CREATE INDEX idx_card_category ON card (category_id);
CREATE INDEX idx_card_tribe ON card (tribe_id);

CREATE TABLE card_attribute (
  card_id       INT NOT NULL REFERENCES card,
  attribute_id  INT NOT NULL REFERENCES attribute
);

CREATE INDEX idx_card_attribute_attribute_id
    ON card_attribute (attribute_id);

INSERT INTO expansion (expansion_name)
SELECT DISTINCT expansion FROM hs_raw_import;

INSERT INTO character_class (class_name)
SELECT DISTINCT character_class FROM hs_raw_import;

INSERT INTO category (category_name)
SELECT DISTINCT category FROM hs_raw_import;

INSERT INTO minion_tribe (tribe_name)
SELECT DISTINCT tribe FROM hs_raw_import;
 WHERE tribe IS NOT NULL;

INSERT INTO attribute (att_name) VALUES ('taunt'), ('enrage');

INSERT INTO card (
    expansion_id, card_name, class_id, rarity, category_id,
    tribe_id, mana_cost, attack, hitpoints, card_text
)
SELECT e.expansion_id, r.card_name, cl.class_id, r.rarity,
       ca.category_id, t.tribe_id, r.mana_cost::INT,
       r.attack::INT, r.hitpoints::INT, r.card_text
  FROM hs_raw_import r
  JOIN expansion e ON (e.expansion_name = r.expansion)
  JOIN character_class cl ON (cl.class_name = r.character_class)
  JOIN category ca ON (ca.category_name = r.category)
  LEFT JOIN minion_tribe t ON (t.tribe_name = r.tribe);

INSERT INTO card_attribute (card_id, attribute_id)
SELECT c.card_id, a.attribute_id
  FROM card c
  JOIN attribute a ON (c.card_text ~* a.att_name);

ANALYZE attribute;
ANALYZE card_attribute;
ANALYZE expansion;
ANALYZE character_class;
ANALYZE category;
ANALYZE minion_tribe;
ANALYZE card;

Whew! That was a lot of work! But in the end, we have a constellation that accurately represents how cards work, with future potential of adding more mechanics without the necessity of reorganizing our architecture. There are obviously more potential attributes than taunt and enrage, but we’ll leave the data mining as an exercise for anyone interested enough to really flesh out the model.

Postgres offers us a nice way to leverage this data, too. Having to decode all of these IDs is pretty annoying, so we probably want a view. Further, cards don’t change very often, and new expansions are only released every few months. This makes them a great candidate for a materialized view. Here’s how we’d do that:

CREATE MATERIALIZED VIEW v_all_cards AS
SELECT e.expansion_name AS expansion, c.card_name,
       cl.class_name AS class, c.rarity,
       ca.category_name AS category,
       t.tribe_name AS tribe, c.mana_cost,
       c.attack, c.hitpoints, c.card_text,
       ARRAY(
         SELECT a.att_name
           FROM card_attribute tr
           JOIN attribute a USING (attribute_id)
          WHERE tr.card_id = c.card_id
       ) AS attributes
  FROM card c
  JOIN expansion e USING (expansion_id)
  JOIN character_class cl USING (class_id)
  JOIN category ca USING (category_id)
  LEFT JOIN minion_tribe t USING (tribe_id);

CREATE INDEX idx_all_expansion ON v_all_cards (expansion);
CREATE INDEX idx_all_class ON v_all_cards (class);
CREATE INDEX idx_all_category ON v_all_cards (category);
CREATE INDEX idx_all_tribe ON v_all_cards (tribe);

SELECT card_name, class, rarity
  FROM v_all_cards
 WHERE expansion = 'Classic'
   AND 'enrage' = ANY(attributes);

      card_name      |  class  |  rarity   
---------------------+---------+-----------
 Angry Chicken       | Neutral | Rare
 Amani Berserker     | Neutral | Common
 Raging Worgen       | Neutral | Common
 Tauren Warrior      | Neutral | Common
 Spiteful Smith      | Neutral | Common
 Grommash Hellscream | Warrior | Legendary

And of course, this particular rabbit hole can go much, much deeper. With only a few hundred rows, it may seem silly to design the database this way, but slap a front-end on this, and it’s a collection management tool that rivals the one built into the actual game client. The official client only allows players to search their collection by expansion, mana cost, and text, while our structure has far more flexibility.

This is just one example for using Postgres in an everyday scenario. There’s much more out there if you’re interested in looking. Happy hunting!

See Also