- Create 2 identical tables:
CREATE TABLE logs1 (
user_id int4 NOT NULL,
create_time timestamptz NOT NULL DEFAULT now()
);
CREATE TABLE logs2 (
user_id int4 NOT NULL,
create_time timestamptz NOT NULL DEFAULT now()
);
- I insert 200,000,000 entries into the first table, 2,000,000 for each of the 100 users:
INSERT INTO public.logs1 (user_id) (select 1 from generate_series(1,2000000));
INSERT INTO public.logs1 (user_id) (select 2 from generate_series(1,2000000));
INSERT INTO public.logs1 (user_id) (select 3 from generate_series(1,2000000));
--...
INSERT INTO public.logs1
SELECT user_id
, now()+make_interval(secs => user_id)
FROM generate_series(4,100) AS user_id
CROSS JOIN LATERAL generate_series(1,2e6) AS duplicate_count;
- Insert 200,000,000 entries into the second table, but in such a way that the data of several users takes up most of the total data:
INSERT INTO public.logs2(user_id) select 1 from generate_series(1,95802970);
INSERT INTO public.logs2(user_id) select 2 from generate_series(1,10561980);
INSERT INTO public.logs2(user_id) select 3 from generate_series(1,17014900);
--...
INSERT INTO public.logs2
SELECT ordinality+3
, now()+make_interval(secs => ordinality)
FROM unnest(ARRAY[/*95802970,10561980,17014900,*/14987000,14701010,13137950,7180130,6930550,4260890,3702960,3215300,2648260,1197320,802200,709840,661650,628120,590780,433570,291460,195000,158450,55980,34210,21660,17400,9500,7850,7450,4910,4860,3150,2760,2620,2160,1810,1630,1410,1090,1050,830,610,560,540,500,490,330,240,210,160,130,110,100,100,90,90,90,80,80,70,70,70,60,40,40,40,30,30,30,30,30,20,20,20,20,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10])
WITH ORDINALITY AS _(val,ordinality)
CROSS JOIN LATERAL generate_series(1,val);
- Create identical indexes:
CREATE INDEX logs1_user_id_idx ON ONLY logs1 USING btree (user_id);
CREATE INDEX logs1_create_time_idx ON ONLY logs1 USING btree (create_time);
CREATE INDEX logs1_user_id_create_time_idx ON ONLY logs1 USING btree (user_id, create_time);
CREATE INDEX logs2_user_id_idx ON ONLY logs2 USING btree (user_id);
CREATE INDEX logs2_create_time_idx ON ONLY logs2 USING btree (create_time);
CREATE INDEX logs2_user_id_create_time_idx ON ONLY logs2 USING btree (user_id, create_time);
- Run
analyze:
analyze logs1;
analyze logs2;
As a result, I have 2 tables with the same schemas, indexes, the same amount of data, but with a different distribution of this data.
I ran this query on both tables:
EXPLAIN ANALYZE
SELECT *
FROM logs1 -- logs2
WHERE user_id = 1001
ORDER BY create_time DESC
OFFSET 0 ROWS
FETCH FIRST 10 ROWS only
And here is the result:
--------- logs1 ---------
Limit (cost=0.57..8.59 rows=1 width=12) (actual time=0.026..0.028 rows=0 loops=1)
-> Index Only Scan Backward using logs1_user_id_create_time_idx on logs1 (cost=0.57..8.59 rows=1 width=12) (actual time=0.024..0.024 rows=0 loops=1)
Index Cond: (user_id = 1001)
Heap Fetches: 0
Planning Time: 0.447 ms
Execution Time: 0.062 ms
--------- logs2 ---------
Limit (cost=0.57..0.78 rows=10 width=12) (actual time=0.042..0.044 rows=0 loops=1)
-> Index Only Scan Backward using logs2_user_id_create_time_idx on logs2 (cost=0.57..414.57 rows=20000 width=12) (actual time=0.039..0.040 rows=0 loops=1)
Index Cond: (user_id = 1001)
Heap Fetches: 0
Planning Time: 0.220 ms
Execution Time: 0.080 ms
In both cases the correct index was used - logs1_user_id_create_time_idx, which caused the query to execute quickly. Well, now I run this query:
EXPLAIN ANALYZE
SELECT *
FROM logs1 -- logs2
WHERE user_id IN (1001, 1002)
ORDER BY create_time DESC
OFFSET 0 ROWS
FETCH FIRST 10 ROWS only
And here is the result:
--------- logs1 ---------
Limit (cost=13.18..13.18 rows=1 width=12) (actual time=0.086..0.087 rows=0 loops=1)
-> Sort (cost=13.18..13.18 rows=1 width=12) (actual time=0.083..0.084 rows=0 loops=1)
Sort Key: create_time DESC
Sort Method: quicksort Memory: 25kB
-> Index Only Scan using logs1_user_id_create_time_idx on logs1 (cost=0.57..13.17 rows=1 width=12) (actual time=0.074..0.074 rows=0 loops=1)
Index Cond: (user_id = ANY ('{1001,1002}'::integer[]))
Heap Fetches: 0
Planning Time: 0.223 ms
Execution Time: 0.124 ms
--------- logs2 ---------
Limit (cost=0.57..1303.11 rows=10 width=12) (actual time=254657.997..254658.003 rows=0 loops=1)
-> Index Scan Backward using logs2_create_time_idx on logs2 (cost=0.57..5210160.65 rows=40000 width=12) (actual time=254657.992..254657.993 rows=0 loops=1)
Filter: (user_id = ANY ('{1001,1002}'::integer[]))
Rows Removed by Filter: 200000000
Planning Time: 0.247 ms
Execution Time: 254658.069 ms
In case of logs1, the query executed quickly because the same index was used as in the previous query.
In logs2, the query executed 2 million times longer (over 4 minutes) because a different index was used - logs2_create_time_idx. If I remove this index: logs2_create_time_idx, then postgres will come to its senses and use logs2_user_id_create_time_idx, causing the query to execute quickly:
Limit (cost=9.18..9.18 rows=1 width=12) (actual time=0.143..0.146 rows=0 loops=1)
-> Sort (cost=9.18..9.18 rows=1 width=12) (actual time=0.141..0.143 rows=0 loops=1)
Sort Key: create_time DESC
Sort Method: quicksort Memory: 25kB
-> Index Only Scan using logs1_user_id_create_time_idx on logs1 (cost=0.57..9.17 rows=1 width=12) (actual time=0.122..0.123 rows=0 loops=1)
Index Cond: (user_id = ANY ('{1001,1002}'::integer[]))
Heap Fetches: 0
Planning Time: 0.245 ms
Execution Time: 0.188 ms
I am in a situation where postgres selects the wrong index for me, which is why the query takes incomparably longer to execute than if the correct index had been selected. What can I do about it? In sql server I could force the engine to use a specific index, but in pogres, as far as I know, this cannot be done.
I will also add that postgres uses the correct index when I operate on a smaller amount of data, e.g. not 200,000,000, but 2,000,000.