Recently I stumbled across a question on Reddit regarding the performance impact of using pgBadger on an active database server. The only real answer to this question is: do not use pgBadger. Before anyone asks—no, you shouldn’t use pgFouine either. This is not an indictment on the quality of either project, but a statement of their obsolescence in the face of recent PostgreSQL features.
One of the recommended
postgresql.conf changes for both of these tools is to set
log_min_duration_statement to 0. There are a few other changes they require to put the log entries in the correct format, but we can worry about those later. For now, let’s focus on a little innocent math.
One of the PostgreSQL servers I work with, processes almost two billion queries per day. Let’s imagine every such query is very simple, even though this is definitely not the case. Consider this an example query:
SELECT col1, col2 FROM my_table WHERE id=?
Assuming the query is paramterized, and the number is from one to a million, our average query length is 47 characters. Let’s just say it’s 50 to keep things easy. If we multiply that by two billion, that’s 100-billion bytes of logged SQL. Seen another way, that’s 93GB of logs per day, or about 1MB of log data per second.
In practice, such a short query will not constitute the bulk of a PostgreSQL server’s workload. In fact, if even a simple ORM is involved, all queries are likely to be far more verbose. Java’s hibernate in particular is especially prone to overly gratuitous aliases prepended to all result columns. This is what our query would look like after Hibernate was done with it:
SELECT opsmytable1_.col1, opsmytable1_.col1 FROM my_table opsmytable1_ WHERE opsmytable1_.id=?
If we ignore the whitespace I added for readability, and use values from one to a million, the average query length becomes 99. Remember, this is ignoring all useful data PostgreSQL would also be logging! There are also a number of other problems with many of my operating assumptions. It’s very unlikely that query traffic will be consistent, nor will the queries themselves be so short. In addition, I didn’t account for the length of the log prefix that should contain relevant metadata about the query and its duration.
Once on a boring day, I enabled all query logging just to see how verbose our logs became. On that fateful day, I set
log_min_duration_statement to 0 for approximately ten seconds, and the result was 140MB worth of log files. Thus was my curiosity sated, and my soul filled with abject horror. Faced with such untenable output, how can we do log analysis? There’s no way pgBadger can process 100GB of logs in a timely manner. I tried using it a while ago, and even that ten seconds of log output required over a minute of processing.
It turns out PostgreSQL has had an answer to this for a while, but it wasn’t until the release of 9.2 that the feature became mature enough to use regularly. The pg_stat_statements extension maintains a system catalog table that tracks query performance data in realtime. Constants and variables are replaced to generalize the results, and it exposes information such as the number of executions, the total average run time of all executions, the number of rows matched, and so on. This is more than any log processing utility can do given the most verbose settings available.
I could spend hours churning through log files, or I can execute a query like this:
SELECT calls, total_time/calls AS avg_time, query FROM pg_stat_statements ORDER BY 2 DESC LIMIT 10;
That query just returned the ten slowest queries in the database. I could easily modify this query to find the most frequently executed queries, and thus improve our caching layer to include that data. This module can be directly responsible for platform improvements if used properly, and the amount of overhead is minimal.
In addition, the log settings are still available in conjunction with
pg_stat_statements. I normally recommend setting
log_min_duration_statement to a value that’s high enough to remove log noise, but low enough that it exposes problems early. I have ours set to 1000 so any query that runs longer than one second is exposed. Even on a system as active as ours, this produces about 5MB of log entries per day. This is a much more reasonable amount of data for log analysis, spot-checking, or finding abnormal system behavior.
All of this said, we could just as easily watch the database cluster and set
log_min_duration_statement to a nonzero amount of milliseconds. For most systems, even 20 milliseconds would be enough to prevent log output from saturating our disk write queue. However, the
pg_stat_statements extension automatically takes care of performance statistics without any post-processing or corresponding increase in log verbosity, so why add pgBadger to the stack at all?
There may be a compelling argument I’m missing, but for now I suggest using
pg_stat_statements without PostgreSQL-focused log post-processing. Ops tools like Graylog or logstash are specifically designed to parse logs for monitoring significant events, and keeping the signal to noise ratio high is better for these tools.
Save logs for errors, warnings, and notices; PostgreSQL is great at keeping track of its own performance.