PG Phriday: Mining for Metadata
Every good database engine has a system catalog that describes the myriad of structures that model and preserve our data. Of course this is expected, as it would be somewhat silly for a database system not to use tables to represent its internal mechanisms. But that doesn’t mean they have to be humanly readable, or even make sense without a series of views or esoteric functions to decipher them. The information_schema standard serves a necessary role in that regard, and the Postgres implementation is extremely comprehensive. Yet the regular Postgres catalog is also fairly usable, so let’s explore and see what we find.
At first glance, there is a veritable smorgasbord of catalog tables and views to choose from. So, where to start? Let’s look at something simple and see what tables are available:
\x CREATE TABLE picard ( id SERIAL PRIMARY KEY, full_name TEXT NOT NULL, description TEXT ); INSERT INTO picard (full_name, description) SELECT a.id::TEXT, 'There are ' || a.id || ' lights!' FROM generate_series(1, 1000000) a(id); ANALYZE picard; SELECT * FROM pg_tables WHERE schemaname NOT IN ('information_schema', 'pg_catalog'); -[ RECORD 1 ]-------- schemaname | public tablename | picard tableowner | sthomas tablespace | hasindexes | t hasrules | f hastriggers | f
This is all pretty basic info, right? Note that we had to remove tables in the
pg_catalog schemas or we’d have to wade through over 50 other records. Just more proof that Postgres uses its own catalogs to… track its catalogs. What may not be so obvious is that
pg_tables is actually a view.
A DBA might find this obvious, given the concept of normalization. For the uninitiated however, database object names might be referred to multiple times through several catalog entries. To avoid using the same text value in multiple places and risking mismatches or modifying the name everywhere should it change, Postgres uses surrogate keys. Again, this approach is extremely common, but makes views a necessity to decode those proxy values.
pg_views and examine the definition of
pg_tables as well as itself:
SELECT definition FROM pg_views WHERE viewname = 'pg_tables'; definition ---------------------------------------------------------------- SELECT n.nspname AS schemaname, c.relname AS tablename, pg_get_userbyid(c.relowner) AS tableowner, t.spcname AS tablespace, c.relhasindex AS hasindexes, c.relhasrules AS hasrules, c.relhastriggers AS hastriggers FROM ((pg_class c LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) LEFT JOIN pg_tablespace t ON ((t.oid = c.reltablespace))) WHERE (c.relkind = 'r'::"char"); SELECT definition FROM pg_views WHERE viewname = 'pg_views'; definition -------------------------------------------------------------- SELECT n.nspname AS schemaname, c.relname AS viewname, pg_get_userbyid(c.relowner) AS viewowner, pg_get_viewdef(c.oid) AS definition FROM (pg_class c LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = 'v'::"char");
What do we have, here? It appears as if both
pg_views are derived from information in
pg_class. As it turns out,
pg_class is probably the most important catalog table in the system. It’s essentially a reference for nearly every database object available. Since it’s in the middle of a vast constellation of other references, practically every column in the table is a key to another low-level catalog.
Why are views a
relkind of ‘v’, while tables are designated ‘r’? Well in Postgres, that ‘r’ stands for ‘relation’, as tables describe relationships across data. Probably. Yet the documentation for
pg_class describes all contents as relations, not just tables. Further, the column names in
pg_class don’t reflect a consistent association to the table name. Let’s look at
pg_namespace, which decodes schema (namespace) names, to see why that’s relevant:
\d pg_namespace Column | Type | Modifiers ----------+-----------+----------- nspname | name | not null nspowner | oid | not null nspacl | aclitem | Indexes: "pg_namespace_nspname_index" UNIQUE, btree (nspname) "pg_namespace_oid_index" UNIQUE, btree (oid)
pg_views utilize this table to decode
pg_class.relnamespace into a text value. Note the ’nsp’ prefix for all of the column names? That naming scheme is fairly consistent across most of the other catalog tables and views. The
pg_proc table that tracks functions (procedures) prefixes everything with ‘pro’, while
pg_index uses ‘ind’. From that perspective,
pg_class is a rather notable exception.
Though a bit perplexing and lost to the annals of history, one benefit of this abnormality is that it makes
pg_class easy to remember. It’s a short name, and since it’s used practically everywhere in the other catalog views, it’s almost synonymous with the catalog itself. Regardless of why,
pg_class is of the first places to look for nearly anything not covered by a ’nicer’ view. A fun example of this is exploiting the
SELECT relname AS tablename, relpages*8 AS kb_used, reltuples AS approx_rows FROM pg_class WHERE relname = 'picard'; tablename | kb_used | approx_rows -----------+---------+------------- picard | 66592 | 1e+06
These two are generally used by the query planner in various calculations to estimate query cost. For our purposes, they provide an excellent summary of the table’s overall impact on the database. We can use sorts to find the table with the most or least rows, the biggest table on disk, combine the two to seek out tables with abnormally large rows, and so on.
relpages column refers to the amount of database pages the table consumes. In Postgres, this defaults to 8KB, so in multiplying by 8, we can see how much physical space the table uses. In the vast majority of cases, this is more than sufficient and a quick shortcut when doing ad-hoc analysis of tables, but there are cases when an installation is compiled with a different default page size. So to be pedantic, we should repeat the same query using a couple provided decoding functions:
\x SELECT relname AS tablename, pg_relation_size(oid) AS bytes_used, pg_total_relation_size(oid) AS total_bytes, pg_size_pretty(pg_relation_size(oid)) AS pretty_size, pg_size_pretty(pg_total_relation_size(oid)) AS pretty_total, reltuples AS approx_rows FROM pg_class WHERE relname = 'picard'; -[ RECORD 1 ]+--------- tablename | picard bytes_used | 68190208 total_bytes | 90726400 pretty_size | 65 MB pretty_total | 87 MB approx_rows | 1e+06
To provide a lowest common denominator, the
pg_relation_size function always returns the size of the object in bytes. The
pg_size_pretty function just makes that value human readable by transforming it into the number of megabytes, gigabytes, or so on. The reason we used both in the query is to demonstrate the functionality along with the fact we don’t want to use human readable output in calculations.
But what about
pg_total_relation_size? This function is a neat trick that combines the table size along with any related content, like indexes, TOAST data, and so on. Normally the only way to get that data would be to join
pg_class with itself multiple times with
pg_index. That query is pretty unwieldy, so a function call is much easier to use. If we had omitted or changed our
WHERE clause, we could have divined size and row counts for whole swaths of tables in the database.
And what about
pg_index? Its job is to act as a swing table between a table and its indexes as described by
pg_class, as well as describe which columns are in each index, and other identifying information. But this particular catalog table is somewhat opaque, using multiple vector columns and trees to describe objects which can contain multiple other objects. It’s usually much easier to use
pg_indexes instead, which converts all of that into something human readable. For example:
\x SELECT * FROM pg_indexes WHERE tablename = 'picard'; -[ RECORD 1 ]---------------------------------------------------------- schemaname | public tablename | picard indexname | picard_pkey tablespace | indexdef | CREATE UNIQUE INDEX picard_pkey ON picard USING btree (id)
There isn’t very much information here, actually. But we do get to use the table name to obtain all indexes without any joins, and we can also see the index definition itself. For more complicated dives, like finding all tables that use a particular column in their indexes, especially if we also want to include composite indexes, we’d have to grit our teeth and use
pg_index instead. Though even then, it’s not so bad:
SELECT c.relname AS tablename, i.relname AS indexname FROM pg_index x JOIN pg_class c ON (c.oid = x.indrelid) JOIN pg_class i ON (i.oid = x.indexrelid) JOIN pg_attribute a ON ( a.attrelid = x.indrelid AND a.attnum = ANY (x.indkey) ) WHERE a.attname = 'id'; tablename | indexname -----------+------------- picard | picard_pkey
While this is hardly a novice-friendly example of multiple catalog tables, it does demonstrate that inquisitive data mining is possible. There’s a surprising amount of wealth stored in Postgres metadata if you’re brave enough to dig for it. Some masochistic DBAs have even constructed unholy amalgams of several catalog tables to derive potential index candidates, optimal column re-ordering for existing indexes, or even which indexes to remove entirely.
How is that even possible? Well, beyond the basic architecture metadata, Postgres collects statistics it stores in yet more tables and views. We’ll leave using these as an exercise for a future article, because they deserve to be explored in depth. Until then, explore the catalog and see what’s buried in its depths; you might be surprised what you find.