0

I know the meaning of distinct and also generate series. But when I execute this query, question marks are flying around my head.

select distinct generate_series(0,8)

The result is very weird.

enter image description here

Can somebody please help me to explain what is happening?

3
  • Why weird result ? The query means "give me a distinct set of values", and a result contains only distinct values. If you want an ordered set, then append ORDER BY clause. Could be that PostgeSql builds a hash map in memory to eliminate duplicates, and returns these values ordered by hash. Commented Mar 10, 2014 at 18:55
  • @kordirko To someone not used to thinking about the abstractions of query planning, it is surprising that adding a DISTINCT can change the order of results. However, your comment is effectively a short version of my answer below. :) Commented Mar 10, 2014 at 18:58
  • @IMSoP thank you for comprehensive explanation, +1 from me, since I've learned something new from your answer - I didn't know that the explain shows HashAggregate when it is using a hash map. Commented Mar 10, 2014 at 19:16

1 Answer 1

3

A SELECT query with no ORDER BY clause has no defined order, it will simply return the relevant rows in whatever order happens to be convenient to the executing DBMS.

In the case of a "real" table, this might be in order of PRIMARY KEY, in the order they were inserted into the table, or in the order of a particular index that was used in the execution plan.

In this example, the "table" created by generated_series() obviously starts off in the order 0, 1, 2, 3, etc. However, in order to check the DISTINCT constraint you put on the query, Postgres has to do something to check if items appear more than once. (There is no way for it to know that the generate_series() function will always provide distinct values.)

An efficient way of doing this (in general) is to build a "hash map" of the values you want to check for uniqueness. Rather than checking each new value against every existing value, you calculate which "hash bucket" it would fall into; if the bucket is empty, the value is unique; if not, you need only compare it against the other values in that bucket.

Running EXPLAIN select distinct generate_series(0,8) will show you the query plan Postgres has selected; for me (and presumably for you) this looks like this:

HashAggregate  (cost=0.02..0.03 rows=1 width=0)
  ->  Result  (cost=0.00..0.01 rows=1 width=0)

As expected, there's a HashAggregate operation there, running over the result of the generate_series() in order to check it for uniqueness. (Exactly how that operation works I don't know, and isn't important, but the name strongly suggests it's using a hash map to do the work).

At the end of the hashing operation, Postgres can simply read out the values from the hash map, rather than going back to the original list, so it does so. As a result, they are no longer in the original order, but ordered according to the "hash buckets" they fell into.

The moral of the story is: Always use an ORDER BY clause!

Sign up to request clarification or add additional context in comments.

6 Comments

Thanx for your responds! However, when i execute this query, it is ordered! select * from generate_series(0,8) So why should distinct randomize its output?
Ok... because of hashaggregate.. i think i should learn more about that :)
@user2513954 You don't really need to know about HashAggregate, just know that if you don't specify an ORDER BY, there is absolutely no guarantee of a particular order. The DBMS will do whatever it can to process the rest of the query, and whatever order the data happens to be in when it's done that is the order you will see. It just happens that when running select * from generate_series(0,8) the data happens to end up in a neat ascending order.
don't you think that is a bit weird?
@user2513954 Not really: you didn't ask for an order, so the DBMS didn't define one. The order 0,1,2,3 etc seems "natural" to you, but Postgres doesn't think like a human. Importantly, extending this to more "real life" queries, most queries will not have a neat "natural" order like generate_series() does. They'll be retrieving from some index; or a fragmented page of data where rows have been updated, deleted, and re-inserted; or a complex combination of multiple tables and JOIN operations.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.