13

I want to index data in height dimensions (128 dimensional vectors of integers in range of [0,254] are possible):

| id |      vector       |
|  1 | { 1, 0, ..., 254} |
|  2 | { 2, 128, ...,1}  |
|  . | { 1, 0, ..., 252} |
|  n | { 1, 2, ..., 251} |

I saw that PostGIS implemented R-Trees. So can I use these trees in PostGIS to index and query multidimensional vectors in Postgres?

I also saw that there is a index implementation for int arrays.

Now I have questions about how to perform a query.
Can I perform a knn-search and a radius search on an integer array? Maybe I also must define my own distance function. Is this possible? I want to use the Manhattan distance (block distance) for my queries.

I also can represent my vector as a binary string with the pattern v1;v2;...;vn. Does this help to perform the search?

For example if I had these two string:

1;2;1;1
1;3;2;2

The result / distance between these two strings should be 3.

2
  • With your example it isn't exactly clear - do you want to see how many elements differ (levenshtein distance) or how much do they differ (Manhattan distance), eg (3-2) + (2-1) + (2-1). Commented Feb 15, 2016 at 14:34
  • @hruske: I want to know how much do they differ (manhattan, taxlicab, block)- distance Commented Feb 15, 2016 at 14:38

3 Answers 3

20

Perhaps a better choice would be the cube extension, since your area of interest is not individual integer, but full vector.

Cube supports GiST indexing, and Postgres 9.6 will also bring KNN indexing to cubes, supporting euclidean, taxicab (aka Manhattan) and chebishev distances.

It is a bit annoying that 9.6 is still in development, however there's no problem backporting patch for cube extension to 9.5 and I say that from experience.

Hopefully 128 dimensions will still be enough to get meaningful results.

How to do this?

First have an example table:

create extension cube;
create table vectors (id serial, vector cube);

Populate table with example data:

insert into vectors select id, cube(ARRAY[round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000)]) from generate_series(1, 2000000) id;

Then try selecting:

explain analyze SELECT * from vectors
order by cube(ARRAY[966,82,765,343,600,718,338,505]) <#> vector asc limit 10;
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=123352.07..123352.09 rows=10 width=76) (actual time=1705.499..1705.501 rows=10 loops=1)
   ->  Sort  (cost=123352.07..129852.07 rows=2600000 width=76) (actual time=1705.496..1705.497 rows=10 loops=1)
         Sort Key: (('(966, 82, 765, 343, 600, 718, 338, 505)'::cube <#> vector))
         Sort Method: top-N heapsort  Memory: 26kB
         ->  Seq Scan on vectors  (cost=0.00..67167.00 rows=2600000 width=76) (actual time=0.038..998.864 rows=2600000 loops=1)
 Planning time: 0.172 ms
 Execution time: 1705.541 ms
(7 rows)

We should create an index:

create index vectors_vector_idx on vectors (vector);

Does it help:

explain analyze SELECT * from vectors
order by cube(ARRAY[966,82,765,343,600,718,338,505]) <#> vector asc limit 10;

--------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.41..1.93 rows=10 width=76) (actual time=41.339..143.915 rows=10 loops=1)
   ->  Index Scan using vectors_vector_idx on vectors  (cost=0.41..393704.41 rows=2600000 width=76) (actual time=41.336..143.902 rows=10 loops=1)
         Order By: (vector <#> '(966, 82, 765, 343, 600, 718, 338, 505)'::cube)
 Planning time: 0.146 ms
 Execution time: 145.474 ms
(5 rows)

At 8 dimensions, it does help.

Sign up to request clarification or add additional context in comments.

2 Comments

I also can represent the vector as a binary string(v1;v2;v3...;vn).Is there a option to compare binary strings with manhattan distance?
Taxicab distance is also known as Manhattan distance. en.wikipedia.org/wiki/Taxicab_geometry
12

(Addendum to selected answer)

For people wanting more than 100 dimensions, beware: there's a 100 dimensions limit in cube extension.

The tricky part is that postgres allows you to create cubes with more than 100 dimensions just fine. It's when you try to restore a backup that it is refused (the worst time to realize that).

As recommended in documentation, I patched cube extension to support more dimensions. I made a docker image for it, and you can look at the Dockerfile to see how to do it yourself, from the github repos.

Comments

0

Cross-posting my answer from this similar DBA Stack Exchange question:


The 2023 answer to this is to use the pgvector extension.

As of this writing, the number of maximum dimensions (defined here) is 16k.

5 Comments

It's unfortunate that pgvector is not supported in AWS RDS hosted Postgres. Realistically my current project is not going to move off RDS. cube is on RDS though.
@Kobold for sure -- pgvector recommends (here) to contact AWS via this page about adding extensions. I feel like there's a good chance of pgvector making it in considering the excitement around embeddings right now.
What's the difference between pgvector and cube?
cube is limited in the number of dimensions it supports. From the postgres docs: "To make it harder for people to break things, there is a limit of 100 on the number of dimensions of cubes". Depending on the use case, it may be enough or not (AI embeddings typically require a few thousand dimensions)
pgVector is indeed supported now by RDS, AWS posted the announcement recently. See aws.amazon.com/about-aws/whats-new/2023/05/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.