PG Phriday: Functions and Performance Attributes

Users both old and new are likely aware that Postgres has functions. Some lucky readers may have even written one or two. For those without much experience in this area, or are thinking about contributing a few functions to your database for the first time, there are a couple things you should know. This week, I want to cover a few of the function declaration decorators. If you have no idea what I’m talking about, then this is probably something you’ll want to read regardless of your experience level.

Let’s begin with a very basic function that, as usual, does nothing:

CREATE OR REPLACE FUNCTION f_do_nothing(myvar INT)
RETURNS INT AS
$$
BEGIN
    RETURN myvar;
END;
$$ LANGUAGE plpgsql;

\timing on

SELECT f_do_nothing(1)
  FROM generate_series(1, 1000000) AS s (id);

Now, hopefully we can agree that a function with one parameter which returns the contents of the parameter and does no work, is pretty simple. Calling this function one million times in a test VM required about two seconds. Some astute readers might point out that I could have used the SQL language instead of PL/pgSQL, and they’d be absolutely right. Doing so would reduce run time to about 600ms for this example.

Fortunately, the decorators I’ll be discussing will be much more powerful than even switching to a much simpler and subsequently less versatile language. As of PostgreSQL 9.4, there are four that will actually impact performance: VOLATILE, STABLE, STRICT, and IMMUTABLE.

The VOLATILE hint isn’t strictly necessary, as it’s the default. Effectively it means that the function can’t be optimized at all, as its effects can vary wildly based on table contents and parameters, even during execution. This causes PostgreSQL to invoke the function each time it’s used, regardless of what the function is actually doing. For our function, that’s clearly a detrimental approach. So what other hints do we have available?

The STABLE attribute is a bit more stringent. The implication in this case, is that the function can not modify the contents of the database in any way. Within the context of a transaction, this means the function can be semi-optimized by preventing extra calls because the data should not change. For this particular example, the STABLE keyword reduced run-time by about ten percent, to about 1.7 seconds on average. Here’s what that looks like:

CREATE OR REPLACE FUNCTION f_do_nothing(myvar INT)
RETURNS INT AS
$$
BEGIN
    RETURN myvar;
END;
$$ LANGUAGE plpgsql STABLE;

\timing on

SELECT f_do_nothing(1)
  FROM generate_series(1, 1000000) AS s (id);

After this comes the IMMUTABLE keyword, which takes the optimization one step further. This tells the planner that the function will always return the same result for the same parameters. Due to that guarantee, function calls can tacitly be replaced by the the cached results where that benefits the planner. One way to look at this is by examining the performance:

CREATE OR REPLACE FUNCTION f_do_nothing(myvar INT)
RETURNS INT AS
$$
BEGIN
    RETURN myvar;
END;
$$ LANGUAGE plpgsql IMMUTABLE;

\timing on

SELECT f_do_nothing(1)
  FROM generate_series(1, 1000000) AS s (id)

The IMMUTABLE decorator reduced run time to 400ms in this case, but we don’t really know how many times the function was actually called. Just because a function call can be replaced by its results, doesn’t mean that will happen. Let’s modify the situation slightly by adding an explicit pause in the code:

CREATE OR REPLACE FUNCTION f_do_nothing(myvar INT)
RETURNS INT AS
$$
BEGIN
    PERFORM pg_sleep(1);
    RETURN myvar;
END;
$$ LANGUAGE plpgsql IMMUTABLE;

\timing on

SELECT f_do_nothing(1), f_do_nothing(1);

If your system is like mine, this example ran for two seconds, despite the fact that PostgreSQL could have replaced the second call with the results of the first. This seems counter-intuitive, but the real power of the IMMUTABLE keyword is not simply in the replacement structure, but in the parameter/result equivalence. It’s a fundamental misunderstanding that the database must prevent excess function calls, though that would be ideal. Let’s modify the code again to make the situation more explicit:

CREATE OR REPLACE FUNCTION f_do_nothing_imm(myvar INT)
RETURNS INT AS
$$
BEGIN
    RAISE NOTICE 'IMMUTABLE RAN!';
    RETURN myvar;
END;
$$ LANGUAGE plpgsql IMMUTABLE;

CREATE OR REPLACE FUNCTION f_do_nothing_vol(myvar INT)
RETURNS INT AS
$$
BEGIN
    RAISE NOTICE 'VOLATILE RAN!';
    RETURN myvar;
END;
$$ LANGUAGE plpgsql VOLATILE;

SELECT 1
  FROM generate_series(1, 5) AS s (id)
 WHERE f_do_nothing_imm(10) = 10
   AND f_do_nothing_vol(10) = 10;

And the NOTICE output in this case is very telling:

NOTICE:  IMMUTABLE RAN!
NOTICE:  VOLATILE RAN!
NOTICE:  VOLATILE RAN!
NOTICE:  VOLATILE RAN!
NOTICE:  VOLATILE RAN!
NOTICE:  VOLATILE RAN!

While it seems that the SELECT section being optimized is somewhat arbitrary, the WHERE portion appears to have been reduced appropriately. It all comes down to how the planner decides to simplify the execution plan of the query, and the conditionals in the WHERE clause are much more likely to experience substitution.

Finally comes the STRICT decorator. It goes one step further than IMMUTABLE and assumes that a function with any NULL parameters will also return NULL. Again, this is an optimization where the planner can completely substitute a function call with NULL, thereby not only removing the function execution itself, but avoiding the requisite function parameter and return value mapping. Its nature also means it can be combined with the other keywords, like so:

CREATE OR REPLACE FUNCTION f_do_nothing(myvar INT)
RETURNS INT AS
$$
BEGIN
    RETURN myvar;
END;
$$ LANGUAGE plpgsql VOLATILE STRICT;

\timing on

SELECT f_do_nothing(NULL)
  FROM generate_series(1, 1000000) AS s (id);

This time, even though the function is clearly marked VOLATILE, it doesn’t execute even once. The run time is merely the time required to generate and return one million values.

Now, regardless of power these keywords provide, there are some important caveats that apply. When we used notices to view the call path of our functions, the planner did what we expected, and prevented extra calls to the immutable version. So let’s try that example one more time with a variable in place and examine those notices again:

SELECT 1
  FROM generate_series(1, 5) AS s (id)
 WHERE f_do_nothing_imm(s.id % 2) = s.id % 2
   AND f_do_nothing_vol(s.id % 2) = s.id % 2;

NOTICE:  VOLATILE RAN!
NOTICE:  IMMUTABLE RAN!
NOTICE:  VOLATILE RAN!
NOTICE:  IMMUTABLE RAN!
NOTICE:  VOLATILE RAN!
NOTICE:  IMMUTABLE RAN!
NOTICE:  VOLATILE RAN!
NOTICE:  IMMUTABLE RAN!
NOTICE:  VOLATILE RAN!
NOTICE:  IMMUTABLE RAN!

Doh! Though the immutable function should have only run twice, we see that it ran just as often as the volatile version. It would appear that the function declaration documentation clears this up (emphasis mine):

If this option is given, any call of the function with all-constant arguments can be immediately replaced with the function value.

The problem in this case, is that the PostgreSQL planner automatically considers variables as their worst-case scenario: any number of infinite values could be substituted here. Even though we know only two of the five values were passed to the function, the PostgreSQL does not until execution time. Since the planner is determining how IMMUTABLE affects execution, that keyword is ignored for dynamic parameters.

With that said, the IMMUTABLE keyword exists beyond the context of optimization. In Set Theory, functions must be deterministic. Thus for indexes to operate as expected, functions must always return the same values when given the same inputs. Since PostgreSQL supports functional indexes, this means that only IMMUTABLE functions are candidates for providing indexable values.

This is also why the substitution process used by the planner seems arbitrary from an outside perspective; we’re not privy to the entire decision tree that generates the final execution plan. In many cases, either multiple function executions can not be reduced, or the cost and complexity of doing so would be computationally prohibitive.

In the end, it comes down to opportunity cost. By using these decorators, we give PostgreSQL as much information as possible to make decisions that affect the query execution plan. Consequentially, while we may reduce query run time, we do increase the amount of possible function candidates for table indexing. This last part is especially important, because such indexes do reduce the computational overhead of executing the function on the indexed value.

As the query planner improves over successive generations, these decorators only get more powerful. Try to use them where appropriate, and you can only improve the situation.