PostgreSQL custom aggregate: more efficient option than array concatenation

Question

Consider a custom aggregate intended to take the set union of a bunch of arrays:

CREATE FUNCTION array_union_step (s ANYARRAY, n ANYARRAY) RETURNS ANYARRAY
   AS $$ SELECT s || n; $$
   LANGUAGE SQL IMMUTABLE LEAKPROOF PARALLEL SAFE;

CREATE FUNCTION array_union_final (s ANYARRAY) RETURNS ANYARRAY
  AS $$
    SELECT array_agg(i ORDER BY i) FROM (
      SELECT DISTINCT UNNEST(x) AS i FROM (VALUES(s)) AS v(x)
    ) AS w WHERE i IS NOT NULL;
  $$
  LANGUAGE SQL IMMUTABLE LEAKPROOF PARALLEL SAFE;

CREATE AGGREGATE array_union (ANYARRAY) (
  SFUNC = array_union_step,
  STYPE = ANYARRAY,
  FINALFUNC = array_union_final,
  INITCOND = '{}',
  PARALLEL = SAFE
);

As I understand it, array concatenation in PostgreSQL copies all the elements of both inputs into a new array, so this is quadratic in the total number of elements (before deduplication). Is there a more efficient alternative without writing extension code in C? (Specifically, using either LANGUAGE SQL or LANGUAGE plpgsql.) For instance, maybe it's possible for the step function to take and return a set of rows somehow?

An example of the kind of data this needs to be able to process:

create temp table demo (tag int, values text[]);
insert into demo values
   (1, '{"a", "b"}'),
   (2, '{"c", "d"}'),
   (1, '{"a"}'),
   (2, '{"c", "e", "f"}');

select tag, array_union(values) from demo group by tag;
 tag | array_union 
-----+-------------
   2 | {c,d,e,f}
   1 | {a,b}

Note in particular that the built-in array_agg cannot be used with this data, because the arrays are not all the same length:

select tag, array_agg(values) from demo group by tag;
ERROR:  cannot accumulate arrays of different dimensionality

Erwin Brandstetter · Accepted Answer · 2020-07-30 23:00:41Z

3

Array concatenation is expensive. That's why build-in array_agg() uses the internal array builder structure. Unfortunately, you cannot use this API on the SQL level.

I don't think using temp tables for custom aggregation is correct. Creating and dropping tables is expensive (temp table too) or very expensive (with high frequency), INSERT and SELECT is expensive, too. (Tables have a much more complex format than arrays.) If you need really fast aggregation, then write a C functions and use the array builder.

If you cannot use your own C extension, then use the built-in array_agg() function with deduplication and sort on already aggregated data.

CREATE OR REPLACE FUNCTION array_distinct(anyarray)
RETURNS anyarray AS $$
BEGIN
  RETURN ARRAY(SELECT DISTINCT v FROM unnest($1) u(v) WHERE v IS NOT NULL ORDER BY v);
END;
$$ LANGUAGE plpgsql IMMUTABLE STRICT;

Call:

SELECT ..., array_distinct(array_agg(somecolumn)) FROM tab;

edited Jul 30, 2020 at 23:00

Erwin Brandstetter

669k160 gold badges1.2k silver badges1.3k bronze badges

answered Jul 30, 2020 at 21:29

Pavel Stehule

46.7k6 gold badges104 silver badges103 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Erwin Brandstetter Over a year ago

Or just LANGUAGE sql? With PARALLEL SAFE?

Bergi Over a year ago

I'm curious, is it only array concatenation or also array indexed assignment that is slow? Could this aggregate be improved by an array_builder as well?

zwol Over a year ago

Unfortunately, this doesn't work for me, because array_agg(somecolumn) is only valid when somecolumn has a scalar type (anyelement in pg-ese) or when all the arrays in somecolumn have exactly the same dimensions. I have a column of 1D arrays of variable length.

Pavel Stehule Over a year ago

@ErwinBrandstetter - yes, it is possible too, and it is faster

Pavel Stehule Over a year ago

@zwol - array_agg can aggregate arrays too. Just you need Postgres 9.5 and higher. But if cannot to use array_agg, then you need to write own C function.

|

Collectives™ on Stack Overflow

PostgreSQL custom aggregate: more efficient option than array concatenation

1 Answer 1

10 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related