Extracting values from non-standard markup strings in PostgreSQL

Question

Unfortunately, I have a table like the following:

DROP TABLE IF EXISTS my_list;
CREATE TABLE my_list (index int PRIMARY KEY, mystring text, status text);

INSERT INTO my_list    
(index, mystring,                                           status) VALUES 
   (12, '',                                                    'D'), 
   (14, '[id] 5',                                              'A'), 
   (15, '[id] 12[num] 03952145815',                            'C'), 
   (16, '[id] 314[num] 03952145815[name] Sweet',               'E'), 
   (19, '[id] 01211[num] 03952145815[name] Home[oth] Alabama', 'B');

Is there any trick to get out number of [id] as integer from the mystring text shown above? As though I ran the following query:

SELECT index, extract_id_function(mystring), status FROM my_list;

and got results like:

12  0     D  
14  5     A 
15  12    C 
16  314   E 
19  1211  B

Preferably with only simple string functions and if not regular expression will be fine.

Always specify your PostgreSQL version in questions. (Please edit, and comment here when done). Then please find whoever designed that schema and say mean things to them ;-) . I'll give an answer a go anyway. — Craig Ringer
– Craig Ringer, Commented Jan 6, 2014 at 12:33
Also, why do you wish to avoid regex? Sometimes they're the right tool for the job. Especially given how painful string manipulation in SQL can be because of the inability to easily refer to values elsewhere at the same query level. — Craig Ringer
– Craig Ringer, Commented Jan 6, 2014 at 13:04
My actual version is 9.1 on windows 7. I do some tests with regex on queries and then I find that that regex have problems with unicode letters which are quite often in my language so I can't use it reliable. As obvious I make that shema and I am ready to say that for myself :) But today I will never do that for sure. For programming I use .NET where such expressions is not a problem but I wasn't think enough to PostgreSQL. — Wine Too
– Wine Too, Commented Jan 6, 2014 at 15:39
Update. See stackoverflow.com/a/14293924/398670 . Regular expression support for unicode and utf-8 was significantly improved in 9.2. (That's why it's best to explain the why in your questions, not just the what, and to include sample data that accurately reflects the real problem.) — Craig Ringer
– Craig Ringer, Commented Jan 7, 2014 at 0:01

Community · Accepted Answer · 2017-05-23 12:11:36Z

2

If I understand correctly, you have a rather unconventional markup format where [id] is followed by a space, then a series of digits that represents a numeric identifier. There is no closing tag, the next non-numeric field ends the ID.

If so, you're going to be able to do this with non-regexp string ops, but only quite badly. What you'd really need is the SQL equivalent of strtol, which consumes input up to the first non-digit and just returns that. A cast to integer will not do that, it'll report an error if it sees non-numeric garbage after the number. (As it happens I just wrote a C extension that exposes strtol for decoding hex values, but I'm guessing you don't want to use C extensions if you don't even want regex...)

It can be done with string ops if you make the simplifying assumption that an [id] nnnn tag always ends with either end of string or another tag, so it's always [ at the end of the number. We also assume that you're only interested in the first [id] if multiple appear in a string. That way you can write something like the following horrible monstrosity:

select
  "index",
  case 
    when next_tag_idx > 0 then substring(cut_id from 0 for next_tag_idx) 
    else cut_id 
  end AS "my_id",
  "status"
from (
  select 
    position('[' in cut_id) AS next_tag_idx,
    *
  from (
    select 
      case 
        when id_offset = 0 then null 
        else substring(mystring from id_offset + 4) 
      end AS cut_id,
      *
    from (
      select
        position('[id] ' in mystring) AS id_offset,
        *
      from my_list
    ) x
  ) y
) z;

(If anybody ever actually uses that query for anything, kittens will fall from the sky and splat upon the pavement, wailing in horror all the way down).

Or you can be sensible and just use a regular expression for this kind of string processing, in which case your query (assuming you only want the first [id]) is:

regress=> SELECT
            "index", 
            coalesce((SELECT (regexp_matches(mystring, '\[id\]\s?(\d+)'))[1])::integer, 0) AS my_id,
            status 
          FROM my_list;
 index | my_id          | status 
-------+----------------+--------
    12 | 0              | D
    14 | 5              | A
    15 | 12             | C
    16 | 314            | E
    19 | 01211          | B
(5 rows)

Update: If you're having issues with unicode handling in regex, upgrade to Pg 9.2. See https://stackoverflow.com/a/14293924/398670

edited May 23, 2017 at 12:11

CommunityBot

11 silver badge

answered Jan 6, 2014 at 12:46

Craig Ringer

329k84 gold badges742 silver badges820 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Wine Too Over a year ago

Hello Craig and thank you for wide explanations with examples. This is really point where I should think to make a script to change past data! Since both solutions are complicated and time consumpting. Anyway, regex version look more acceptable and less error prone. Can your expression be upgraded to get number 0 on index 12 (first row)?

Craig Ringer Over a year ago

@user973238 Sure. That's a simple coalesce. And yeah, I'd recommend splitting the data up in your schema so you don't have to do this kind of processing all the time. If you're attempting to store key/value data (tags, etc) where there's no fixed list of property names you can use as columns, look into hstore, or consider storing json fields. Or you can fall back on EAV if you're stuck.

Wine Too Over a year ago

Very interesting thing hstore, didn't know for that. I though about storing such things as XML but I see there are provided methods. Thanks Craig.

Collectives™ on Stack Overflow

Extracting values from non-standard markup strings in PostgreSQL

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related