2

I have a query in PostgresSQL accessing a big table using a LIKE clause for pattern matching:

                                                    Table "rmx_service_schema.document"
        Column         |            Type             | Collation | Nullable | Default | Storage  | Compression | Stats target | Description 
-----------------------+-----------------------------+-----------+----------+---------+----------+-------------+--------------+-------------
 id                    | character varying(36)       |           | not null |         | extended |             |              | 
 file_name             | character varying(512)      |           | not null |         | extended |             |              | 
...

The query has very good selectivity:

select count(*) from RMX_SERVICE_SCHEMA.DOCUMENT d1_0;
 count  
--------
 630015


select count(*) from RMX_SERVICE_SCHEMA.DOCUMENT d1_0 where d1_0.FILE_NAME LIKE 'sunet_attachments/20240207.xml';
 count 
-------
     1

The application somtimes uses % at the end of the pattern, so replacing the LIKE by = is not always possible

I have created an index on that column with the matching operator definition:

CREATE INDEX rse_tmp_doc_file_name ON RMX_SERVICE_SCHEMA.DOCUMENT (file_name varchar_pattern_ops);

But still, the pattern matching query does a Seq Scan:

EXPLAIN ANALYZE select id from RMX_SERVICE_SCHEMA.DOCUMENT d1_0 where d1_0.FILE_NAME LIKE 'sunet_attachments/2024020.xml';

                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..129562.16 rows=63 width=37) (actual time=81.075..90.793 rows=1 loops=1)
   Workers Planned: 4
   Workers Launched: 4
   ->  Parallel Seq Scan on document d1_0  (cost=0.00..128555.86 rows=16 width=37) (actual time=72.099..77.022 rows=0 loops=5)
         Filter: ((file_name)::text ~~ 'sunet_attachments/20240207.xml'::text)
         Rows Removed by Filter: 126007
 Planning Time: 0.285 ms
 Execution Time: 90.814 ms
(8 rows)

If I replace the LIKE by =, it uses the index:

EXPLAIN ANALYZE select id from RMX_SERVICE_SCHEMA.DOCUMENT d1_0 where d1_0.FILE_NAME ='sunet_attachments/20240207.xml';
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using rse_tmp_doc_file_name on document d1_0  (cost=0.55..8.57 rows=1 width=37) (actual time=0.025..0.026 rows=1 loops=1)
   Index Cond: ((file_name)::text = 'sunet_attachments/20240207.xml'::text)
 Planning Time: 0.053 ms
 Execution Time: 0.034 ms
(4 rows)

Did I miss some stpes required to make this btree index usable for pattern matching query?

Indexes:
    "pk_document" PRIMARY KEY, btree (id)
     ....
    "rse_tmp_doc_file_name" btree (file_name varchar_pattern_ops)

I was expecting the index I created is used for pattern matching, too, as long as selectivity is good and the pattern doesn't start by wildcards.

I have tried SET enable_seqscan=off, as suggested. The plan changed, but is still very slow:

EXPLAIN ANALYZE select id from RMX_SERVICE_SCHEMA.DOCUMENT d1_0 where d1_0.FILE_NAME LIKE 'sunet_attachments/20240207.xml';
                                                                      QUERY PLAN                                                                      
------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=31133.85..158837.55 rows=63 width=37) (actual time=300.945..314.717 rows=1 loops=1)
   Workers Planned: 4
   Workers Launched: 4
   ->  Parallel Bitmap Heap Scan on document d1_0  (cost=30133.85..157831.25 rows=16 width=37) (actual time=290.728..297.328 rows=0 loops=5)
         Filter: ((file_name)::text ~~ 'sunet_attachments/20240207.xml'::text)
         Rows Removed by Filter: 71233
         Heap Blocks: exact=19555
         ->  Bitmap Index Scan on rse_tmp_doc_file_name  (cost=0.00..30133.83 rows=355328 width=0) (actual time=149.426..149.426 rows=356167 loops=1)
               Index Cond: (((file_name)::text ~>=~ 'sunet'::text) AND ((file_name)::text ~<~ 'suneu'::text))
 Planning Time: 0.176 ms
 Execution Time: 314.747 ms
(11 rows)

But this plan gave me the right hint. The problem is the _ after the string sunet. This has to be escaped, otherwise it isn't selective, since about 50% of the file_name values in the table start with sunet. With correct escaping in the SQL, the index works:

EXPLAIN ANALYZE select id from RMX_SERVICE_SCHEMA.DOCUMENT d1_0 where d1_0.FILE_NAME LIKE 'sunet\_attachments/20240207.xml' escape '\';

                                                                             QUERY PLAN                                                                             
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using rse_tmp_doc_file_name on document d1_0  (cost=0.55..8.57 rows=63 width=37) (actual time=0.014..0.015 rows=1 loops=1)
   Index Cond: (((file_name)::text ~>=~ 'sunet_attachments/20240207_10111647337'::text) AND ((file_name)::text ~<~ 'sunet_attachments/20240207'::text))
   Filter: ((file_name)::text ~~ 'sunet\_attachments/20240207.xml'::text)
 Planning Time: 0.152 ms
 Execution Time: 0.024 ms
(5 rows)
4
  • The question suggested it related to pg_trgm indexes, my question is related to btree indexes. Commented Sep 10 at 15:04
  • Can you run SET enable_seqscan = off;, then run the EXPLAIN ANALYZE with LIKE again and add the result to the question? Commented Sep 10 at 15:10
  • @Laurenz Albe Thanks a lot, that was the right hint. I have to escape the _ in my search pattern. Now it works. Commented Sep 10 at 15:32
  • Great. Rather than adding the solution to the question, you could write an answer to your own question. I for one would be happy to upvote it. Commented Sep 10 at 15:39

1 Answer 1

1

The trailing '%' isn't what is stopping the index from working, it is the '_' in the directory-name. That matches any character. You are going to have to escape underscores.

https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-LIKE

Sign up to request clarification or add additional context in comments.

2 Comments

That underscore wildcard isn't preventing PostgreSQL from using the index. But there may be so many rows matching the prefix that PostgreSQL estimates a sequential scan to be cheaper.
This is true. Unfortunately, about 50% of the values have the prefix sunet. With escaping the _ it works fine now. Thanks a lot to all!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.