1

As you can see I have pretty simple SQL statement:

SELECT DISTINCT("CITY" || ' | '  || "AREA" || ' | ' || "REGION") AS LOCATION
FROM youtube

The youtube table which I use in query has ~ 25 million records. The query takes a very long time to complete (~25 seconds). I'm trying to speed up the request.

I create an index as shown below but my query higher still takes the same time to complete. Whatdid I do wrong? By the way, is it better to use “partitioning” in my case?

CREATE INDEX location_index ON youtube ("CITY", "AREA", "REGION")

EXPLAIN returns:

Unique (cost=5984116.71..6111107.27 rows=96410 width=32)
-> Sort (cost=5984116.71..6047611.99 rows=25398112 width=32)
   Sort Key: ((((("CITY" || ' | '::text) || "AREA") || ' | '::text) || "REGION"))
   -> Seq Scan on youtube (cost=0.00..1037365.24 rows=25398112 width=32) 

@george-joseph QUERY PLAN of your script:

enter image description here

5
  • Can you try this query: select concat(city, '|', area, '|', region) as location from (select city, area, region, count(*) youtube group by city, area, region) x;? How long does that take? Commented Dec 6, 2018 at 3:46
  • @zedfoxus your query takes ~ 10-12 seconds to complete. Commented Dec 6, 2018 at 3:57
  • Great. That cut time into half. If you want it faster than that, you may want to consider a materialized view (postgresqltutorial.com/postgresql-materialized-views) and refresh it routinely through a scheduled task/cron job. Commented Dec 6, 2018 at 4:06
  • As I said before youtube table which I use has ~ 25 million records. New data is loaded into the table every 5 minutes. Maybe it's better to create index and partition to table? My main question was about that. What do you think about that? Commented Dec 6, 2018 at 4:32
  • You can try that. It's hard to say with certainty if partitioning will solve your problem. If you are running this kind of query infrequently and user can wait 10 seconds, no need to make any changes. If your user cannot wait that long, cache the result every hour into a materialized view. Your query doesn't have a where clause so I don't think partitioning will help you. Commented Dec 6, 2018 at 4:36

3 Answers 3

5

Neither an index nor partitioning can help you here.

Since city, area and region are (probably) closely correlated, the number of result rows will be much less than PostgreSQL estimates, because it assumes columns to be independent from each other.

So you should create extended statistics on those columns, a new feature introduced in PostgreSQL v10:

CREATE STATISTICS youtube_stats (ndistinct)
   ON "CITY", "AREA", "REGION" FROM youtube;

ANALYZE youtube;

Now PostgreSQL has a better idea of how many different groups there are.

Then give the query a lot of memory so that it can get a hash with all these groups into memory. Then it can use a hash aggregate rather than sorting the rows:

SET work_mem = '1GB';

You may not need all that much memory; experiment to find a more reasonable limit.

Then try the query from George Joseph's answer:

SELECT x."CITY" || ' | '  || x."AREA" || ' | ' || x."REGION" AS location
FROM (SELECT DISTINCT "CITY", "AREA", "REGION"
      FROM youtube) AS x;
Sign up to request clarification or add additional context in comments.

4 Comments

@dwir182 To hear is to obey.
Since city, area and region are (probably) closely correlated, They are also probably low-cardinality fat wide columns. IOW: the OP should normalize his data first. (or use a spreadsheet)
Thanks for giving direction! =) After your answer, my query became faster (~ 5 sec). I think I need to learn more about STATISTICS and I think I need to upgrade my database. As variants it could be max_worker_processes, max_parallel_workers_per_gather. How do you think?
Parallelization will only help with the sequential scan, but that didn't change with my query (and you said that took 5 secs). The rest of the time is in sorting, and that can't profit from parallelization. In general, doing the right thing slow is usually faster in databases than doing the wrong thing fast.
1

Since you got an index on the columns, how does the query plan look like if you were to do as follows

SELECT x.city || ' | '  || x.area || ' | ' || x.region
FROM (SELECT DISTINCT city, area, region
      FROM youtube) x 

4 Comments

QUERY PLAN of my query you can see in the post. Check it again please. I run your code several times. Your query takes ~30 seconds to complete which is more longer than mine. Do you have any ideas?
Can you share the query plan for the query i suggested
Check my post again please. I add QUERY PLAN of your script.
Thanks for that,Have you already ran the analyze command on the table and also update the statistics on the table to see if there is a difference?. Also are the fields city,area and region nullable
0

Indexes should be able to help. Try writing the query as:

SELECT DISTINCT ON (city, area, region) "CITY" || ' | '  || "AREA" || ' | ' || "REGION") AS LOCATION
FROM youtube
ORDER BY city, area, region;

This can take advantage of an index on (city, area, region).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.