How to speed up query with DISTINCT in PostgreSQL?

Question

As you can see I have pretty simple SQL statement:

SELECT DISTINCT("CITY" || ' | '  || "AREA" || ' | ' || "REGION") AS LOCATION
FROM youtube

The youtube table which I use in query has ~ 25 million records. The query takes a very long time to complete (~25 seconds). I'm trying to speed up the request.

I create an index as shown below but my query higher still takes the same time to complete. Whatdid I do wrong? By the way, is it better to use “partitioning” in my case?

CREATE INDEX location_index ON youtube ("CITY", "AREA", "REGION")

EXPLAIN returns:

Unique (cost=5984116.71..6111107.27 rows=96410 width=32)
-> Sort (cost=5984116.71..6047611.99 rows=25398112 width=32)
   Sort Key: ((((("CITY" || ' | '::text) || "AREA") || ' | '::text) || "REGION"))
   -> Seq Scan on youtube (cost=0.00..1037365.24 rows=25398112 width=32)

@george-joseph QUERY PLAN of your script:

enter image description here

Can you try this query: select concat(city, '|', area, '|', region) as location from (select city, area, region, count(*) youtube group by city, area, region) x;? How long does that take? — zedfoxus
– zedfoxus, Commented Dec 6, 2018 at 3:46
Great. That cut time into half. If you want it faster than that, you may want to consider a materialized view (postgresqltutorial.com/postgresql-materialized-views) and refresh it routinely through a scheduled task/cron job. — zedfoxus
– zedfoxus, Commented Dec 6, 2018 at 4:06
As I said before youtube table which I use has ~ 25 million records. New data is loaded into the table every 5 minutes. Maybe it's better to create index and partition to table? My main question was about that. What do you think about that? — Nurzhan Nogerbek
– Nurzhan Nogerbek, Commented Dec 6, 2018 at 4:32
You can try that. It's hard to say with certainty if partitioning will solve your problem. If you are running this kind of query infrequently and user can wait 10 seconds, no need to make any changes. If your user cannot wait that long, cache the result every hour into a materialized view. Your query doesn't have a where clause so I don't think partitioning will help you. — zedfoxus
– zedfoxus, Commented Dec 6, 2018 at 4:36

Laurenz Albe · Accepted Answer · 2018-12-06 07:07:19Z

5

Neither an index nor partitioning can help you here.

Since city, area and region are (probably) closely correlated, the number of result rows will be much less than PostgreSQL estimates, because it assumes columns to be independent from each other.

So you should create extended statistics on those columns, a new feature introduced in PostgreSQL v10:

CREATE STATISTICS youtube_stats (ndistinct)
   ON "CITY", "AREA", "REGION" FROM youtube;

ANALYZE youtube;

Now PostgreSQL has a better idea of how many different groups there are.

Then give the query a lot of memory so that it can get a hash with all these groups into memory. Then it can use a hash aggregate rather than sorting the rows:

SET work_mem = '1GB';

You may not need all that much memory; experiment to find a more reasonable limit.

Then try the query from George Joseph's answer:

SELECT x."CITY" || ' | '  || x."AREA" || ' | ' || x."REGION" AS location
FROM (SELECT DISTINCT "CITY", "AREA", "REGION"
      FROM youtube) AS x;

edited Dec 6, 2018 at 7:07

answered Dec 6, 2018 at 5:15

Laurenz Albe

257k22 gold badges313 silver badges389 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Laurenz Albe Over a year ago

@dwir182 To hear is to obey.

joop Over a year ago

Since city, area and region are (probably) closely correlated, They are also probably low-cardinality fat wide columns. IOW: the OP should normalize his data first. (or use a spreadsheet)

Nurzhan Nogerbek Over a year ago

Thanks for giving direction! =) After your answer, my query became faster (~ 5 sec). I think I need to learn more about STATISTICS and I think I need to upgrade my database. As variants it could be max_worker_processes, max_parallel_workers_per_gather. How do you think?

Laurenz Albe Over a year ago

Parallelization will only help with the sequential scan, but that didn't change with my query (and you said that took 5 secs). The rest of the time is in sorting, and that can't profit from parallelization. In general, doing the right thing slow is usually faster in databases than doing the wrong thing fast.

Renz Dominique · Accepted Answer · 2018-12-06 03:57:25Z

1

Since you got an index on the columns, how does the query plan look like if you were to do as follows

SELECT x.city || ' | '  || x.area || ' | ' || x.region
FROM (SELECT DISTINCT city, area, region
      FROM youtube) x

edited Dec 6, 2018 at 3:57

Renz Dominique

16813 bronze badges

answered Dec 6, 2018 at 3:48

George Joseph

5,93213 silver badges25 bronze badges

4 Comments

Nurzhan Nogerbek Over a year ago

QUERY PLAN of my query you can see in the post. Check it again please. I run your code several times. Your query takes ~30 seconds to complete which is more longer than mine. Do you have any ideas?

George Joseph Over a year ago

Can you share the query plan for the query i suggested

Nurzhan Nogerbek Over a year ago

Check my post again please. I add QUERY PLAN of your script.

George Joseph Over a year ago

Thanks for that,Have you already ran the analyze command on the table and also update the statistics on the table to see if there is a difference?. Also are the fields city,area and region nullable

Gordon Linoff · Accepted Answer · 2018-12-06 12:36:10Z

0

Indexes should be able to help. Try writing the query as:

SELECT DISTINCT ON (city, area, region) "CITY" || ' | '  || "AREA" || ' | ' || "REGION") AS LOCATION
FROM youtube
ORDER BY city, area, region;

This can take advantage of an index on (city, area, region).

answered Dec 6, 2018 at 12:36

Gordon Linoff

1.3m62 gold badges706 silver badges857 bronze badges

Collectives™ on Stack Overflow

How to speed up query with DISTINCT in PostgreSQL?

3 Answers 3

4 Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related