Postgres - Performance of select for large jsonb column

Question

We are using Postgres jsonb type in one of our database tables. Table structure is shown as below:

CREATE TABLE IF NOT EXISTS public.draft_document (
    id bigserial NOT NULL PRIMARY KEY,
    ...
    document jsonb NOT NULL,
    ein_search character varying(11) NOT NULL
);

CREATE INDEX IF NOT EXISTS count_draft_document_idx ON public.draft_document USING btree (ein_search);
CREATE INDEX IF NOT EXISTS read_draft_document_idx ON public.draft_document USING btree (id, ein_search);

The json structure of document column may vary. Below is one example of a possible schema for document:

"withholdingCredit": {  
    "type": "array",
    "items": {
        "$ref": "#/definitions/withholding"
    }
}

Where the withholding structure (array elements) respects:

"withholding": {
    "properties": {
        ...
        "proportionalityIndicator": {
            "type": "boolean"
        },
        "tribute": {
            "$ref": "#/definitions/tribute"
        },
        "payingSourceEin": {
            "type": "string"
        },
        "value": {
            "type": "number"
        }
        ...
    }
    ...
},      
"tribute": {
    "type": "object",
    "properties": {
        "code": {
            "type": "number"
        },
        "additionalCode": {
            "type": "number"
        }
        ...
    }
}

Here is an example of the json into document jsonb column:

{
   "withholdingCredit":[
      {
         "value": 15000,
         "tribute":{
            "code": 1216,
            "additionalCode": 2
         },
         "payingSourceEin": "03985506123132",
         "proportionalityIndicator": false
      },
      ...
      {
         "value": 98150,
         "tribute":{
            "code": 3155,
            "additionalCode": 1
         },
         "payingSourceEin": "04185506123163",
         "proportionalityIndicator": false
      }
   ]
}

The maximum number of elements in the array can vary up to a maximum limit of 100.000 (one hundred thousand) elements. It is a business limit.

We need a paged select query that returns the withholding array disaggregated (1 element per row), where each row also brings the sum of the withholding elements value and the array length. The query also needs to return the withholdings ordered by proportionalityIndicator, tribute-->code, tribute-->additionalCode, payingSourceEin. Something like:

id	sum	jsonb_array_length	jsonb_array_elements
30900	1.800.027	2300	{"value":15000,"tribute":{"code":1216,...}, ...}
...	...	...	{ ... }
30900	1.800.027	2300	{"value":98150,"tribute":{"code":3155,...}, ...}

We have defined the following query:

SELECT dft.id, 
    SUM((elem->>'value')::NUMERIC),
    jsonb_array_length(dft.document->'withholdingCredit'),
    jsonb_array_elements(jsonb_agg(elem 
    ORDER BY 
        elem->>'proportionalityIndicator',
        (elem->'tribute'->>'code')::NUMERIC,
        (elem->'tribute'->>'additionalCode')::NUMERIC,
        elem->>'payingSourceEin'))
FROM 
    draft_document dft
    CROSS JOIN LATERAL jsonb_array_elements(dft.document->'withholdingCredit') arr(elem)
WHERE (dft.document->'withholdingCredit') IS NOT NULL
    AND dft.id = :id
    AND dft.ein_search = :ein_search
GROUP BY dft.id
LIMIT :limit OFFSET :offset;

This query works, but with performance limitation when we have a large number of elements into the jsonb array. Any suggestion on how to improve it is welcome.

BTW, we are using Postgres 9.6.

Ditch the de-normalization using JSON, create a properly normalized model. — user330315
– user330315, Commented Nov 29, 2021 at 20:30
Version 9.6 just received it's last patch, it's not supported anymore. Make a plan to upgrade to a more recent version, like 13 or 14. And extraction 100.000 elements out of a json object will never be fast. — Frank Heikens
– Frank Heikens, Commented Nov 29, 2021 at 20:52
If you need to manage sections of the JSON in a separated way for pagination or other purpose, then I would normalize the database model and separate the data into different related entities. Pagination and administration will become very simple. JSON is great when you need to handle the document as a whole and you have no interest on analysing it or peeking at its details. — The Impaler
– The Impaler, Commented Nov 29, 2021 at 20:53
Thank you for the feedback. It is in our plans to normalize part of this model. We chose to use json for the document due to the need for its structure to be flexible. — Fabricio Teles
– Fabricio Teles, Commented Nov 30, 2021 at 12:27

jjanes · Accepted Answer · 2021-11-30 05:16:06Z

Your weird query which breaks it apart, aggregates it, and breaks is apart again does seem to trigger some pathological memory management issue in PostgreSQL (tested on 15dev). Maybe you should file a bug report on that.

But you can avoid the problem by just breaking it apart one time. Then you need to use a window function to get the tabulations you want to include all rows even those removed by the offset and limit.

SELECT dft.id, 
    SUM((elem->>'value')::NUMERIC) over (),
    count(*) over (),                                     
    elem                                
FROM 
    draft_document dft
    CROSS JOIN LATERAL jsonb_array_elements(dft.document->'withholdingCredit') arr(elem)
WHERE (dft.document->'withholdingCredit') IS NOT NULL
    AND dft.id = 4
    AND dft.ein_search = '4' 
ORDER BY 
        elem->>'proportionalityIndicator',
        (elem->'tribute'->>'code')::NUMERIC,
        (elem->'tribute'->>'additionalCode')::NUMERIC,
        elem->>'payingSourceEin' 
limit 4 offset 500;

In my hands this gives the same answer as your query, but takes 370 ms rather than 13,789 ms.

At higher offsets than that, my query still works while yours leads to a total lock up requiring a hard reset.

If anyone wants to reproduce the poor behavior, I generated the data by:

insert into draft_document select 4, jsonb_build_object('withholdingCredit',jsonb_agg(jsonb_build_object('value',floor(random()*99999)::int,'tribute','{"code": 1216, "additionalCode": 2}'::jsonb,'payingSourceEin',floor(random()*99999999)::int,'proportionalityIndicator',false))),'4' from generate_series(1,100000) group by 1,3;

Wow, that's amazing. Thank you very much, jjanes! I'd double up vote it if I could.

Collectives™ on Stack Overflow

Postgres - Performance of select for large jsonb column

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related