3

if I write a query as such:

with WordBreakDown (idx, word, wordlength) as (
    select 
        row_number() over () as idx,
        word,
        character_length(word) as wordlength
    from
    unnest(string_to_array('yo momma so fat', ' ')) as word
)
select 
    cast(wbd.idx + (
        select SUM(wbd2.wordlength)
        from WordBreakDown wbd2
        where wbd2.idx <= wbd.idx
        ) - wbd.wordlength as integer) as position,
    cast(wbd.word as character varying(512)) as part
from
    WordBreakDown wbd;  

... I get a table of 4 rows like so:

1;"yo"
4;"momma"
10;"so"
13;"fat"

... this is what I want. HOWEVER, if I wrap this into a function like so:

drop type if exists split_result cascade;
create type split_result as(
    position integer,
    part character varying(512)
);

drop function if exists split(character varying(512), character(1));    
create function split(
    _s character varying(512), 
    _sep character(1)
    ) returns setof split_result as $$
begin

    return query
    with WordBreakDown (idx, word, wordlength) as (
        select 
            row_number() over () as idx,
            word,
            character_length(word) as wordlength
        from
        unnest(string_to_array(_s, _sep)) as word
    )
    select 
        cast(wbd.idx + (
            select SUM(wbd2.wordlength)
            from WordBreakDown wbd2
            where wbd2.idx <= wbd.idx
            ) - wbd.wordlength as integer) as position,
        cast(wbd.word as character varying(512)) as part
    from
        WordBreakDown wbd;  

end;
$$ language plpgsql;

select * from split('yo momma so fat', ' ');

... I get:

1;"yo momma so fat"

I'm scratching my head on this. What am I screwing up?

UPDATE Per the suggestions below, I have replaced the function as such:

CREATE OR REPLACE FUNCTION split(_string character varying(512), _sep character(1))
  RETURNS TABLE (postition int, part character varying(512)) AS
$BODY$
BEGIN
    RETURN QUERY
    WITH wbd AS (
        SELECT (row_number() OVER ())::int AS idx
              ,word
              ,length(word) AS wordlength
        FROM   unnest(string_to_array(_string, rpad(_sep, 1))) AS word
        )
    SELECT (sum(wordlength) OVER (ORDER BY idx))::int + idx - wordlength
          ,word::character varying(512) -- AS part
    FROM wbd;  
END;
$BODY$ LANGUAGE plpgsql;

... which keeps my original function signature for maximum compatibility, and the lion's share of the performance gains. Thanks to the answerers, I found this to be a multifaceted learning experience. Your explanations really helped me understand what was going on.

3 Answers 3

1

Observe this:

select length(' '::character(1));
 length
--------
      0
(1 row)

A cause of this confusion is a bizarre definition of character type in SQL standard. From Postgres documentation for character types:

Values of type character are physically padded with spaces to the specified width n, and are stored and displayed that way. However, the padding spaces are treated as semantically insignificant. Trailing spaces are disregarded when comparing two values of type character, and they will be removed when converting a character value to one of the other string types.

So you should use string_to_array(_s, rpad(_sep,1)).

Sign up to request clarification or add additional context in comments.

4 Comments

Actually, he should use regexp_split_to_table() to begin with.
@ErwinBrandstetter: no, actually not. Try this: create temporary table t as select string_agg(t::text,' ') as t from (select generate_series(1,1000000) as t) as _;. Then select count(*) from (select unnest(string_to_array(t,' ')) from t) as _; would end in 0,2s but select count(*) from (select regexp_split_to_table(t,' ') from t) as _; - I cancelled it after several minutes.
The original function is specified with varchar(512). Clearly, @Jeremy does not intend to parse millions of characters. Your test is interesting from another perspective but misses the point here. For small strings, regexp_split_to_table() is simpler but still a bit slower, so you've got a point there.
I'm marking this as the answer as it most closely addressed the question asked, but the other answers were very enlightening as well.
1

You had several constructs that probably did not do what you think they would.

Here is a largely simplified version of your function, that is also quite a bit faster:

CREATE OR REPLACE FUNCTION split(_string text, _sep text)
  RETURNS TABLE (postition int, part text) AS
$BODY$
BEGIN
    RETURN QUERY
    WITH wbd AS (
        SELECT (row_number() OVER ())::int AS idx
              ,word
              ,length(word) AS wordlength
        FROM   unnest(string_to_array(_string, _sep)) AS word
        )
    SELECT (sum(wordlength) OVER (ORDER BY idx))::int + idx - wordlength
          ,word -- AS part
    FROM wbd;  
END;
$BODY$ LANGUAGE plpgsql;

Explanation

  • Use another window function to sum up the word lengths. Faster, simpler and cleaner. This makes for most of the performance gain. A lot of sub-queries slow you down.

  • Use the data type text instead of character varying or even character(). character varying and character are awful types, mostly just there for compatibility with the SQL standard and historical reasons. There is hardly anything you can do with those that could not better be done with text. In the meantime @Tometzky has explained why character(1) was a particularly bad choice for the parameter type. I fixed that by using text instead.

  • As @Tometzky demonstrated, unnest(string_to_array(..)) is faster than regexp_split_to_table(..) - even if just a tiny bit for small strings like we use here (max. 512 characters). So I switched back to your original expression.

  • length() does the same as character_length().

  • In a query with only one table source (and no other possible naming conflicts) you might as well not table-qualify column names. Simplifies the code.

  • We need an integer value in the end, so I cast all numerical values (bigint in this case) to integer right away, so additions and subtractions are done with integer arithmetic which is generally fastest.
    'value'::int is just shorter syntax for cast('value' as integer) and otherwise equivalent.

5 Comments

It would brake what when _sep would be '.' or '|' or '[' or any other character special for regular expressions. He would not always split by space - he wouldn't need a parameter for this. And regexp_split_to_table() is slower that unnest(string_to_array()).
@Tometzky: You would escape special characters, obviously, should you want to use them. But you are right about performance. unnest(string_to_array()) is faster. The difference is minimal with short strings, but performance of regexp_split_to_table() degrades with very long strings. I amended my function accordingly.
what is the significance of the "x" in x.split? As for the character(1)... not my choice. I am trying to write a comparable Postgresql schema for a dbFactory pattern that is expecting that data type... the MSSQL and MySQL versions both use CHAR(1).
@JeremyHolovacs: Ah, the x. is a leftover from my testbed. I removed it now. I have a schema x where I test the stuff before I post it. There is also the non-standard data type "char" for a single byte character - may be of interest to you. But "char" can only hold basic ascii characters (single-byte!).
+1 for the code and explanation. With some tweaks to maintain the function signature compatibility, I ended up implementing something very similar to this (see my updated question). It's great when I learn more than one thing from one question. Thanks!
0

I found the answer, but I don't understand it.

The string_to_array(_s, _sep) function does not split with a non-varying character; even if I wrote it like so it would not work:

string_to_array(_s, cast(_sep as character_varying(1)))

BUT if I redefined the parameters as such:

drop function if exists split(character varying(512), character(1));    
create function split(
    _s character varying(512), 
    _sep character varying(1)

... all of a sudden it works as I expected. Dunno what to make of this, and really not the answer I wanted... now I have changed the signature of the function, which is not what I wanted to do.

2 Comments

"The string_to_array(_s, _sep) function does not split with a non-varying character" - sorry, don't believe you. CREATE FUNCTION f(varchar(512), char(1)) RETURNS text[] AS $$ SELECT string_to_array($1,$2) $$ LANGUAGE SQL; => SELECT f('a;b;c;d', ';'); f ----------- {a,b,c,d}
@RichardHuxton: it does not split if the character you need to split on is a space - Jeremy isn't lying.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.