How to Efficiently Index and Search Arrays with UNNEST in PostgreSQL

Question

I have a products table with an array column tags that contains up to 700 elements.

I want to efficiently search for specific tag names for a given user to be used in a search bar.

Here's my query:

SELECT tags
FROM (
    SELECT DISTINCT unnest(tags) AS tags
    FROM products
    WHERE user_id = :id AND tags IS NOT NULL AND cardinality(tags) > 0
) AS unnested_tags
WHERE tags LIKE '%:search%';

I initially created this index to improve performance:

CREATE INDEX CONCURRENTLY idx_products_user_id_tags
ON products(user_id, tags);

However, I encountered the following error:

ERROR: index row size 3240 exceeds btree version 4 maximum 2704
DETAIL: Index row references tuple (50426,2) in relation "products".
HINT: Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.

Most likely due to the tags size.

To address this, I attempted to create the following partial index, but it didn't work:

CREATE INDEX CONCURRENTLY idx_products_user_id_tags_partial
ON products(user_id, tags)
WHERE array_length(tags, 1) <= 100;

Which Indexing Approach is Best for My Use Case?

Key Points

Moving tags to a separate table is not feasible at this time.
Partial matches are needed, but exact matches can be considered if they significantly improve query performance.
Tag length ranges from 1 to over 700 strings.
Any tag value can be used greater of equal to 3 characters.

I would appreciate any insights or suggestions!

You might want to try a GIN or GiST index on the array, but the array will still be a challenge. Better to normalize your data, that's what works best in relational databases like PostgreSQL. postgresql.org/docs/current/indexes-types.html — Frank Heikens
– Frank Heikens, Commented Aug 26, 2024 at 18:01
You have two challenges here, first the array and second the LIKE in the where condition. A GiST or GIN index could solve the array problem, but not the LIKE problem. For that you need the pg_trgm extension, but that doesn't work on arrays. You could replace your current table by a view and that view is using a normalized data model. Your current code that expects an array gets an array, but under the hood you have a proper 3NF data model. That can be indexed and delivers the performance you need. — Frank Heikens
– Frank Heikens, Commented Aug 26, 2024 at 18:19
If you can afford the storage space, add the table with one tag per row, plus a trigger on your main table (keep it) to sync them. That way everyone can still use the array, and whoever needs any sort of ~, can use its normalised version. If individual tags may be long, consider full text search, making that 2 added tables: one with tsvector_ops for whole multi-word tags, another with gin_trgm_ops on distinct words in them, to check for misspelling. — Zegarek
– Zegarek, Commented Aug 27, 2024 at 9:45
First things first: WHERE tags LIKE '%:search%' (if it was legal syntax) also returns partial matches. Is that intended? Or do you really filter for exact tag names? Postgres version? SELECT version(); Exact table definition (CREATE TABLE statement) How many rows in your table? Table size? (SELECT pg_relation_size('products'); Predominant read / write patterns? Min & max length of your tag names? Can any tag name be used, or is there a set of allowed tags? — Erwin Brandstetter
– Erwin Brandstetter, Commented Aug 27, 2024 at 15:19

jjanes · Accepted Answer · 2024-08-27 14:48:50Z

The output of the UNNEST construct is unindexable, except via a materialized view in which case it is no longer the output of UNNEST.

You could test the rows which might pass the test before unnesting, and then unnest and test again for the individual tags of the array which triggered the pass. The best way to do this is probably with the parray_gin extension.

SELECT distinct tags
FROM (
    SELECT unnest(tags) AS tags
    FROM products
    WHERE user_id = :id and tags @@> array['%:search%']
) AS unnested_tags
WHERE tags LIKE '%:search%';

Which would be supported by an index such as:

create index on products using gin (user_id, tags parray_gin_ops);

Alternatively, it could use a parray_gin index like the above with only the tags column, and a separate btree index on user_id, automatically combined with a BitmapAnd (which is probably what I would prefer--multicolumn gin indexes have few benefits over multiple single-column ones, and much worse observability.)

The performance of this is going to depend on how selective user_id is, and how selective the tag pattern-match is, and how many trigrams are contained in the search pattern. Among other things. If it uses the index but is still too slow, you would need to share an EXPLAIN (ANALYZE, BUFFERS) for further analysis.

Collectives™ on Stack Overflow

How to Efficiently Index and Search Arrays with UNNEST in PostgreSQL

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related