1

I'm often using the WHERE clause random() > 0.5 to pick a random subset of my data. Now I noticed that when using a set-returning function in a sub-query, I either get the whole set or none (meaning that the WHERE random() > 0.5 clause is interpreted before the set is being generated). e.g.:

SELECT num 
FROM (
    SELECT unnest(Array[1,2,3,4,5,6,7,8,9,10]) num
) AS foo 
WHERE random() > 0.5;

This seems inconsistent because the following query does take the whole set into account:

SELECT num 
FROM (
    SELECT unnest(Array[1,2,3,4,5,6,7,8,9,10]) num
) AS foo 
WHERE random() > 0.1 * num;

Am I correct that this is inconsistent or does it make sense?

Notes:

  • couldn't find another function to test apart from random(), but likely there is some

  • I tested with generate_series as well

2
  • Can you run an EXPLAIN on the queries. It probable the that execution plan is different since it's optimal to not run the random() each time in the first query, but it must in the second. Commented Sep 21, 2016 at 18:49
  • The second statement involves a random value compared to every result of your subquery. The first one has nothing to do with your subquery. It is a single randomly generated number compared to 0.5. Commented Sep 21, 2016 at 18:56

3 Answers 3

3

In the first query the expression in where clause is executed once as it is not related to columns in select list:

Result  (cost=0.01..0.51 rows=100 width=0) (actual time=0.017..0.021 rows=10 loops=1)
  One-Time Filter: (random() > '0.5'::double precision)
Planning time: 0.156 ms
Execution time: 0.058 ms

In the second case the where expression depends on a column:

Subquery Scan on foo  (cost=0.00..2.76 rows=33 width=4) (actual time=0.052..0.083 rows=5 loops=1)
  Filter: (random() > ((0.1 * (foo.num)::numeric))::double precision)
  Rows Removed by Filter: 5
  ->  Result  (cost=0.00..0.51 rows=100 width=0) (actual time=0.017..0.022 rows=10 loops=1)
Planning time: 0.119 ms
Execution time: 0.137 ms
Sign up to request clarification or add additional context in comments.

2 Comments

Technically, I agree with the answers given and they underline what I thought was happening. But somehow I am not happy with the consistency, though I find it hard to put in words. My point is, does it make sens from a use-case point of view? (please tell me if I should rephrase)
I see your point. Note however that order by expression_not_related_to_results is a special case, it's just a trick out of SQL logic.
2

You're right, this does seem very inconsistent.

The key point here is that random() is VOLATILE, which (in theory) means that the query planner should not be optimising away any calls to this function.

Interestingly, this only occurs when you invoke the set-returning function with SELECT f(), as opposed to SELECT * FROM f(); this query gives the expected result:

SELECT num 
FROM (
    SELECT * FROM unnest(Array[1,2,3,4,5,6,7,8,9,10]) num
) AS foo 
WHERE random() > 0.5;

I don't know if this is a bug or just a known limitation, as there are similar cases where this kind of behaviour is expected. For example, compare the following:

SELECT random() FROM generate_series(1,10);          -- 10 random numbers
SELECT (SELECT random()) FROM generate_series(1,10); -- 10 copies of the same random number

If you don't get a definitive answer here, you might want to ask the Postgres mailing list if the behaviour you're seeing is intended.

Comments

1

Indeed the postgres mailinglist gave a good response and it is likely a bug.

This is the answer, including workaround, from Tom Lane:


Hmm, I think this is an optimizer bug. There are two legitimate behaviors here:

SELECT * FROM unnest(ARRAY[1,2,3,4,5,6,7,8,9,10]) WHERE random() > 0.5;

should (and does) re-evaluate the WHERE for every row output by unnest().

SELECT unnest(ARRAY[1,2,3,4,5,6,7,8,9,10]) WHERE random() > 0.5;

should evaluate WHERE only once, since that happens before expansion of the set-returning function in the targetlist. (If you're an Oracle user and you imagine this query as having an implicit "FROM dual", the WHERE should be evaluated for the single row coming out of the FROM clause.)

In the case you've got here, given the placement of the WHERE in the outer query, you'd certainly expect it to be evaluated for each row coming out of the inner query. But the optimizer is deciding it can push the WHERE clause down to become a WHERE of the sub-select. That is legitimate in a lot of cases, but not when there are SRF(s) in the sub-select's targetlist, because that pushes the WHERE to occur before the SRF(s), analogously to the change between the two queries I wrote.

I'm a bit hesitant to change this in existing releases. Given the lack of previous complaints, it seems more likely to break queries that were behaving as-expected than to make people happy. But we could change it in v10 and up, especially since some other corner-case changes in SRF-in-tlist behavior are afoot.

In the meantime, you could force it to work as you wish by inserting the all-purpose optimization fence "OFFSET 0" in the sub-select:

=# SELECT num FROM (
    SELECT unnest(Array[1,2,3,4,5,6,7,8,9,10]) num OFFSET 0) AS foo WHERE random() > 0.5;
 num
-----
   1
   4
   7
   9
(4 rows)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.