1

Consider a custom aggregate intended to take the set union of a bunch of arrays:

CREATE FUNCTION array_union_step (s ANYARRAY, n ANYARRAY) RETURNS ANYARRAY
   AS $$ SELECT s || n; $$
   LANGUAGE SQL IMMUTABLE LEAKPROOF PARALLEL SAFE;

CREATE FUNCTION array_union_final (s ANYARRAY) RETURNS ANYARRAY
  AS $$
    SELECT array_agg(i ORDER BY i) FROM (
      SELECT DISTINCT UNNEST(x) AS i FROM (VALUES(s)) AS v(x)
    ) AS w WHERE i IS NOT NULL;
  $$
  LANGUAGE SQL IMMUTABLE LEAKPROOF PARALLEL SAFE;

CREATE AGGREGATE array_union (ANYARRAY) (
  SFUNC = array_union_step,
  STYPE = ANYARRAY,
  FINALFUNC = array_union_final,
  INITCOND = '{}',
  PARALLEL = SAFE
);

As I understand it, array concatenation in PostgreSQL copies all the elements of both inputs into a new array, so this is quadratic in the total number of elements (before deduplication). Is there a more efficient alternative without writing extension code in C? (Specifically, using either LANGUAGE SQL or LANGUAGE plpgsql.) For instance, maybe it's possible for the step function to take and return a set of rows somehow?


An example of the kind of data this needs to be able to process:

create temp table demo (tag int, values text[]);
insert into demo values
   (1, '{"a", "b"}'),
   (2, '{"c", "d"}'),
   (1, '{"a"}'),
   (2, '{"c", "e", "f"}');

select tag, array_union(values) from demo group by tag;
 tag | array_union 
-----+-------------
   2 | {c,d,e,f}
   1 | {a,b}

Note in particular that the built-in array_agg cannot be used with this data, because the arrays are not all the same length:

select tag, array_agg(values) from demo group by tag;
ERROR:  cannot accumulate arrays of different dimensionality

1 Answer 1

3

Array concatenation is expensive. That's why build-in array_agg() uses the internal array builder structure. Unfortunately, you cannot use this API on the SQL level.

I don't think using temp tables for custom aggregation is correct. Creating and dropping tables is expensive (temp table too) or very expensive (with high frequency), INSERT and SELECT is expensive, too. (Tables have a much more complex format than arrays.) If you need really fast aggregation, then write a C functions and use the array builder.

If you cannot use your own C extension, then use the built-in array_agg() function with deduplication and sort on already aggregated data.

CREATE OR REPLACE FUNCTION array_distinct(anyarray)
RETURNS anyarray AS $$
BEGIN
  RETURN ARRAY(SELECT DISTINCT v FROM unnest($1) u(v) WHERE v IS NOT NULL ORDER BY v);
END;
$$ LANGUAGE plpgsql IMMUTABLE STRICT;

Call:

SELECT ..., array_distinct(array_agg(somecolumn)) FROM tab;
Sign up to request clarification or add additional context in comments.

10 Comments

Or just LANGUAGE sql? With PARALLEL SAFE?
I'm curious, is it only array concatenation or also array indexed assignment that is slow? Could this aggregate be improved by an array_builder as well?
Unfortunately, this doesn't work for me, because array_agg(somecolumn) is only valid when somecolumn has a scalar type (anyelement in pg-ese) or when all the arrays in somecolumn have exactly the same dimensions. I have a column of 1D arrays of variable length.
@ErwinBrandstetter - yes, it is possible too, and it is faster
@zwol - array_agg can aggregate arrays too. Just you need Postgres 9.5 and higher. But if cannot to use array_agg, then you need to write own C function.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.