PG Phriday: Design Pattern Workshop

Friday March 31, 2017

Recently on the pgsql-performance mailing list, a question popped up regarding Postgres RAM usage. In this instance Pietro wondered why Postgres wasn’t using more RAM, and why his process was taking so long. There were a few insightful replies, and they’re each interesting for reasons that aren’t immediately obvious. Let’s see what is really going on here, and perhaps answer a question while we’re at it.

Pietro presents several postgresql.conf settings, but here are the ones that matter:

shared_buffers = 24GB
effective_cache_size = 72GB
work_mem = 512MB
maintenance_work_mem = 1GB

The configured system has 96GB of total RAM. Keeping that in consideration, we see that shared_buffers is set to 24GB, which is a quarter of the server’s available memory. The only potential issue here is that it’s possible a checkpoint could be a very painful experience if a significant portion of that memory becomes dirty. Since there are no complaints of high IO waits, let’s just assume that isn’t happening.

Common advice suggests setting effective_cache_size to an amount equivalent to half of system RAM + shared_buffers. A setting of 72GB looks high, but Postgres does not allocate any of that on its own. This parameter merely influences the query planner, since higher values imply an increased probability a disk page is in either Postgres or OS caches. This is an optimization setting, not a memory allocation.

A work_mem value of 512MB is extremely high. In some cases however, that’s perfectly acceptable. The amount specified here will be used for all internal query execution operations that require a memory allocation. This means sorts, hashes, bitmaps, materializations, and so on. Thus a single query can instantiate multiple 512MB segments for each necessary operation. Effectively the user has greatly reduced the likelihood Postgres will utilize disk storage for any query operations.

And finally we have maintenance_work_mem. This is so often set to 1GB, it might as well be the default. This is the amount of RAM Postgres uses when doing things like creating indexes, vacuuming, and anything else that might need vast amounts of working space to proceed. We don’t know Pietro’s procedures, but we can make some inferences later. This setting really only matters if some part of his data onboarding includes adding new indexes or triggers an autovacuum.

Then we see how memory is performing once Postgres is running:

              total        used        free      shared  buff/cache   available
Mem:            94G         28G         46G         17M         19G         64G
Swap:           15G          0B         15G
Total:         109G         28G         61G

That’s not a lot of usage. Of that 28GB, we know Postgres has 24 simply due to the setting for shared_buffers. This means that the actual queries are only consuming 4GB at most.

This is particularly relevant because if we skip to the end of the post, we see this output from top:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 9686 postgres  20   0 24.918g 214236  12628 R 100.0  0.2   2872:38 postgres
 9687 postgres  20   0 24.918g 214212  12600 R 100.0  0.2   2872:27 postgres
 9688 postgres  20   0 25.391g 709936  12708 R 100.0  0.7   2872:40 postgres
 9691 postgres  20   0 24.918g 214516  12900 R 100.0  0.2   2865:23 postgres
 9697 postgres  20   0 24.918g 214284  12676 R 100.0  0.2   2866:05 postgres
 ...

All 16 of the backends are pegged at 100% CPU, and have been for almost 48 hours each (2872/60). That definitely corroborates Pietro’s claim that this workload has been running for 48 hours. This suggests little to no waiting for disk IO during the entire process.

So will more RAM help? Probably not. Pietro says he’s doing “time series expansion and some arithmetics”. This kind of procedure doesn’t generally require vast amounts of RAM, as calculations don’t often invoke sorting or other temporary memory structures. So what’s really going on?

The second reply gets at the real heart of the matter immediately:

You’d be better off to ask for help in optimizing your queries IMHO.

Why would Scott say that?

Well, there are a lot of ways to do the same thing. In the database world, some of these are highly preferable to others. How exactly is Pietro expanding those time series, for example? He could be doing something like this:

DO $$
DECLARE
  a INT;
  b INT;
  ...
  new_a INT;
  new_b INT;
  ...
BEGIN
  FOR a, b, ... IN SELECT col_a, col_b, ...
  LOOP
    new_a = some_calculation(a, b, c, ..., n);
    new_b = other_calculation(a, b, c, ..., n);

    INSERT INTO target_table VALUES (new_a, new_b, ...);
  END LOOP;
END;
$$ LANGUAGE plpgsql;

If the necessary logic is complicated enough, that might be the right approach. However, it is extremely expensive to loop through each individual row from a driver query, allocate variables, perform calculations, and invoke a single insert. This kind of process would peg a CPU at 100% while still being highly inefficient by multiple orders of magnitude.

There’s also the possibility Pietro knows this, and the calculations were fairly simple. So maybe he decided the process could be inlined a bit more. Instead of that anonymous block (or something equivalent on the application side), he decides to write helper functions to do the calculations instead:

CREATE OR REPLACE FUNCTION some_calculation(a INT, b INT, ...)
RETURNS INT AS
$$
DECLARE
  retval INT;
BEGIN
  -- Some calculations here

  RETURN retval;
END;
$$ LANGUAGE plpgsql STRICT IMMUTABLE;

Being a smart guy, Pietro knows immutable function results are cached much better than otherwise, and strict functions won’t even run if a parameter is null. With these two declarations, if any invocations take the same parameters, or a parameter is empty, Postgres may not even execute the function at all. Avoiding a jump operation is a huge optimization all by itself, so consider the benefits of preventing the overhead of a function call and its associated logic.

By defining the calculations in this manner, his process would look like this instead:

INSERT INTO target_table
SELECT some_calculation(col_a, col_b, ...),
       other_calculation(col_d, col_e, ...),
       ...
  FROM ...
 WHERE ...;

Or maybe Pietro knows the value of data staging. Caching is great, but in sufficiently complicated examples of business logic, transforming it into immutable functions may be somewhat difficult. In addition, some “calculations” may rely on the results of another query. This is where temporary and unlogged tables, or CTEs come in.

Consider the fact databases natively communicate using Set Theory. They are highly optimized, and specifically engineered to bulk-process rows in ways user-generated applications simply can’t match in all but the edgiest of cases. So the final iteration of the data onboarding might resemble this:

CREATE TEMP TABLE expand_series AS
SELECT ...;

CREATE TEMP TABLE relate_series AS
SELECT ...;

CREATE TEMP TABLE basic_tally AS
SELECT ...
  FROM expand_series
  JOIN relate_series ON (...);

INSERT INTO target_table
SELECT ...
  FROM basic_tally
  JOIN ...
  ...
 WHERE ...;

Or maybe Pietro likes CTEs more, so uses those instead of multiple temporary table steps. Postgres will fully materialize everything anyway. The primary benefit to temporary tables is that repeated execution of multiple steps won’t require re-running all previous elements. If this were some kind of one-time report, a CTE would arguably make more sense as it operates atomically.

Regardless, this kind of iterative transformation allows us to fully embrace Set Theory. Don’t operate on one row, but all of them. If we can organize calculations into distinct categories or steps, then we’re actually requesting the database to transform the data as a whole for each operation. For this entire data set, add some derived date column. For this entire data set, produce a standard deviation variant along this window. For this entire data set, derive these additional columns from these formulae.

That’s how databases think, and Postgres is no different in this regard. By looping through each individual result, we’re actively disrupting that process and literally making it hundreds or thousands of times slower. We could have all the memory in the world, and even find some way to fully utilize it, and it wouldn’t influence that universal truth.

Now, we don’t know what Pietro is actually doing, just that he’s convinced more RAM might help. The evidence he provided suggests that isn’t the case. Having sixteen processes running at full blast for two days either means he has a massive amount of data, or there’s some unaddressed inefficiency in his overall approach.

I suspect the latter only because it’s so common. People don’t usually think like databases, and really, who can blame them? What kind of nutcase would willingly do this for a living?

See Also