5

Good Day,

I would like to check what the best way is to partition a Postgres table on a columns prefix. I have a large table (+-300 750 Million rows x 10 columns) and I would like to partition it on a prefix of column 1. Data looks like:

ABCDEF1xxxxxxxx
ABCDEF1xxxxxxxy
ABCDEF1xxxxxxxz
ABCDEF2xxxxxxxx
ABCDEF2xxxxxxxy
ABCDEF2xxxxxxxz
ABCDEF3xxxxxxxx
ABCDEF3xxxxxxxz
ABCDEF4xxxxxxxx
ABCDEF4xxxxxxxy

Their will only ever by 10 partitions i.e. ABCDEF0...->ABCDEF9...

What I've currently done is make tables like:

CREATE TABLE public.mydata_ABCDEF1 (
CHECK ( col1 like 'ABCDEF1%' )
) INHERITS (public.mydata);

CREATE TABLE public.mydata_ABCDEF2 (
CHECK ( col1 like 'ABCDEF2%' )
) INHERITS (public.mydata);

etc. Then the trigger with similar logic:

IF ( NEW.col1 like 'ABCDEF1%' ) THEN 
    INSERT INTO public.mydata_ABCDEF1 VALUES (NEW.*);
ELSIF ( NEW.imsi like 'ABCDEF2%' ) THEN
    INSERT INTO public.simdata_ABCDEF2 VALUES (NEW.*);

I'm concerned if partitioning in this way will speed up query time? or if I should consider partitioning on substr (not sure how), or if I should make a new column with the prefix and partition on that column?

Any advise is appreciated.

3 Answers 3

7

I know this is an old question, but I am adding this answer in case anyone else needs a solution.

Postgres 10 allows range partitioning https://www.postgresql.org/docs/10/static/ddl-partitioning.html.

While the examples in the docs use date ranges, you can also use string ranges since Postgres (mostly) uses ASCII ordering. The below code creates a parent table and then two child tables, which depending on your specific codes, should automatically bin any alphanumeric based on the prefixes provided. The ranges do have to be non-overlapping, which is why I simply cannot create a range from ABCDEF1 to ABCDEF2.

CREATE TABLE mydata (...) PARTITION BY RANGE (col1);
CREATE TABLE mydata_abcdef1 PARTITION OF mydata 
  FOR VALUES FROM ('ACBCDEF1') to ('ABCDEF1z');
CREATE TABLE mydata_abcdef1 PARTITION OF mydata 
  FOR VALUES FROM ('ACBCDEF2') to ('ABCDEF2z');
Sign up to request clarification or add additional context in comments.

Comments

2

It will significantly speed-up your queries when each one of the partitioned tables have their indexes partitioned as appropriately, e.g.:

CREATE INDEX ON public.mydata_ABCDEF1 (...) WHERE col1 like 'ABCDEF1%';

1 Comment

Yes, it is my intention to index the "partition" tables once the data is populated. My question is more around if partitioning this "character" field using "LIKE" is the best method.
0

The short answer is "probably not," but it really depends on exactly what your queries are.

The question is really- what are you trying to accomplish with the partitioning? Generally speaking, PostgreSQL's btree index is very fast and efficient at finding the specific records you are asking for- faster than PostgreSQL is at figuring out which table out of a set of partitioned tables you have data stored in.

Where partitioning is extremely useful is when it helps with data management. The reason it is useful there is that you can often partition based on time and then, when the data has aged long enough, simply remove the older partitioning instead of having to issue "DELETE" queries that mark records as deleted, which then have to be VACUUM'd to have the space reclaimed, and ends up causing bloat in the table and indexes.

300M records is about the point where I might consider partitioning, but I wouldn't jump to partitioning the data at that point without a clear reason why having the data partitioned will be helpful.

Also, be aware that PostgreSQL's query planner does not handle very large numbers of partitions very well; hundreds and thousands of partitions will slow down planning time. That's not very obvious with pre-9.5 versions, but in 9.5 an "EXPLAIN ANALYZE" will return the planning time required for a given query:

=*> explain analyze select * from downloads;
                                                      QUERY PLAN                                       
-------------------------------------------------------------------------------------------------------
 Seq Scan on downloads  (cost=0.00..38591.76 rows=999976 width=193) (actual time=23.863..2088.732 rows=
 Planning time: 0.219 ms
 Execution time: 2552.878 ms
(3 rows)

1 Comment

Firstly a correction, I have the total count of data at 750Million rows. Essentially it is a audit history of equipment with column 1 mentioned in my post being the equipment Id. ABCDEF represents our company and is always part of the id. The 0-9 represents the "bin" (thus maximum of 10 partitions only) followed by the equipment actual id. Partitioning is not intended for data management as all info is kept "forever". Partitioning in my case is purely performance. Queries will be on equipment id. selecting one or grouping on a bin and counting etc.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.