Extract numbers from a field in PostgreSQL

Question

I have a table with a column po_number of type varchar in Postgres 8.4. It stores alphanumeric values with some special characters. I want to ignore the characters [/alpha/?/$/encoding/.] and check if the column contains a number or not. If its a number then it needs to typecast as number or else pass null, as my output field po_number_new is a number field.

Below is the example:

SQL Fiddle.

I tired this statement:

select 
(case when  regexp_replace(po_number,'[^\w],.-+\?/','') then po_number::numeric
else null
end) as po_number_new from test

But I got an error for explicit cast:

Undelete your new question, you will get an answer. Just explain all edge cases carefully, that's all. — Tim Biegeleisen
– Tim Biegeleisen, Commented Nov 14, 2016 at 9:39
It's ok Tim. I will post a proper question from now-on without any edits : ) no issues. By the way thanks : ) — user1538020
– user1538020, Commented Nov 14, 2016 at 9:46

Erwin Brandstetter · Accepted Answer · 2022-02-14 18:02:26Z

85

Simply:

SELECT NULLIF(regexp_replace(po_number, '\D','','g'), '')::numeric AS result
FROM   tbl;

\D being the class shorthand for "not a digit".
And you need the 4th parameter 'g' (for "globally") to replace all occurrences.
Details in the manual.

For a known, limited set of characters to replace, plain string manipulation functions like replace() or translate() are substantially cheaper. Regular expressions are just more versatile, and we want to eliminate everything but digits in this case. Related:

But why Postgres 8.4? Consider upgrading to a modern version.

Consider pitfalls for outdated versions:

edited Feb 14, 2022 at 18:02

answered Nov 12, 2016 at 15:59

Erwin Brandstetter

669k160 gold badges1.2k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Vérace Over a year ago

You might be interested in my proposed solution to this question - using the TRANSLATE function in this particularly easy case instead of the more expensive REGEXP_REPLACE? I'd be interested in, and grateful for, any comment(s) you may have on what I wrote! What led me here was this question - a lot tricker! :-) I have a working solution for PostgreSQL and a guy I was chatting with came up with a SQL Server one - trying to do it without regexes tricky though!

marcosh Over a year ago

Just a note, I had floating point numbers and this solution converted foobar 40.0 into 400. Using .\D instead of just \D worked as desired in my case (obviously not working if you have other . in your strings)

Gordon Linoff · Accepted Answer · 2016-11-12 15:41:35Z

3

I think you want something like this:

select (case when regexp_replace(po_number, '[^\w],.-+\?/', '') ~ '^[0-9]+$'
             then regexp_replace(po_number, '[^\w],.-+\?/', '')::numeric
        end) as po_number_new 
from test;

That is, you need to do the conversion on the string after replacement.

Note: This assumes that the "number" is just a string of digits.

answered Nov 12, 2016 at 15:41

Gordon Linoff

1.3m62 gold badges706 silver badges857 bronze badges

3 Comments

Abelisto Over a year ago

Could you explain the regexp ,.-+? What it means?

Tim Biegeleisen Over a year ago

@Abelisto I think they should be inside the brackets, though I'm not in front of Postgre right now to test it.

user1538020 Over a year ago

I am getting only nulls after using the above... instead I need numbers from each record and nulls for records that doesn't have numbers in it

doninpr · Accepted Answer · 2021-09-30 07:30:03Z

3

If you want to extract floating numbers try to use this:

SELECT NULLIF(regexp_replace(po_number, '[^\.\d]','','g'), '')::numeric AS result FROM tbl;

It's the same as Erwin Brandstetter answer but with different expression:

[^...] - match any character except a list of excluded characters, put the excluded charaters instead of ...

\. - point character (also you can change it to , char)

\d - digit character

answered Sep 30, 2021 at 7:30

doninpr

593 bronze badges

Comments

Tim Biegeleisen · Accepted Answer · 2016-11-13 15:14:40Z

2

The logic I would use to determine if the po_number field contains numeric digits is that its length should decrease when attempting to remove numeric digits.

If so, then all non numeric digits ([^\d]) should be removed from the po_number column. Otherwise, NULL should be returned.

select case when char_length(regexp_replace(po_number, '\d', '', 'g')) < char_length(po_number)
            then regexp_replace(po_number, '[^0-9]', '', 'g')
            else null
       end as po_number_new
from test

edited Nov 13, 2016 at 15:14

answered Nov 12, 2016 at 15:44

Tim Biegeleisen

526k32 gold badges324 silver badges399 bronze badges

6 Comments

user1538020 Over a year ago

I am getting an SQL error [42883] after trying this

Tim Biegeleisen Over a year ago

@user1538020 The error was being caused by you using Postgres 8.x, which does not have a length function. I updated to use char_length and it should work now.

user1538020 Over a year ago

I did a select version(); I am now using PostgreSQL 9.5.2, compiled by Visual C++ build 1800, 64-bit. Still getting an error. I posted the image error.

Erwin Brandstetter Over a year ago

@user1538020: Asides: The error you get is due to a typo: regex_replace <> regexp_replace. And there is a length() function in pg 8.4.

user1538020 Over a year ago

@Tim Biegeleisen : select case when char_length(regexp_replace(po_number, '\d', '', 'g')) < char_length(po_number) then regexp_replace(po_number, '[^0-9]', '', 'g') else null end as po_number_new from test; Corrected the syntax, it now works. thanks..

|

Vérace · Accepted Answer · 2022-02-09 02:49:01Z

Since version 12 - that's 2 years + 4 months ago at the time of writing (but after the last edit that I can see on the accepted answer), you could use a GENERATED FIELD to do this quite easily on a one-time basis rather than having to calculate it each time you wish to SELECT a new po_number.

Furthermore, you can use the TRANSLATE function to extract your digits which is less expensive than the REGEXP_REPLACE solution proposed by @ErwinBrandstetter!

I would do this as follows (all of the code below is available on the fiddle here):

CREATE TABLE s
(
  num TEXT,
  
  new_num INTEGER GENERATED ALWAYS AS
    (NULLIF(TRANSLATE(num, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ. ', ''), '')::INTEGER) STORED
);

You can add to the 'ABCDEFG... string in the TRANSLATE function as appropriate - I have decimal point (.) and a space ( ) at the end - you may wish to have more characters there depending on your input!

And checking:

INSERT INTO s VALUES ('2'), (''), (NULL), (' ');
INSERT INTO t VALUES ('2'), (''), (NULL), (' ');
SELECT * FROM s;
SELECT * FROM t;

Result (same for both):

num    new_num
  2          2
          NULL
          NULL
          NULL

So, I wanted to check how efficient my solution was, so I ran the following test inserting 10,000 records into both tables s and t as follows (from here):

EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
INSERT INTO t 
with symbols(characters) as 
(
  VALUES ('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789')
)
select string_agg(substr(characters, (random() * length(characters) + 1) :: INTEGER, 1), '')
from symbols
join generate_series(1,10) as word(chr_idx) on 1 = 1 -- word length
join generate_series(1,10000) as words(idx) on 1 = 1 -- # of words
group by idx;

The differences weren't that huge but the regex solution was consistently slower by about 25% - even changing the order of the tables undergoing the INSERTs.

However, where the TRANSLATE solution really shines is when doing a "raw" SELECT as follows:

EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
  NULLIF(TRANSLATE(num, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ. ', ''), '')::INTEGER
FROM s;

and the same for the REGEXP_REPLACE solution.

The differences were very marked, the TRANSLATE taking approx. 25% of the time of the other function. Finally, in the interests of fairness, I also did this for both tables:

EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
  num, new_num
FROM t;

Both extremely quick and identical!

Yes, regular expression functions are known to be substantially more expensive (got faster in recent versions, but still). I've kept pointing that out over the last decade: stackoverflow.com/a/28172693/939860, stackoverflow.com/a/34106732/939860, ... However, listing all possible non-digits was somewhat more practical in times of ASCII, not so much with UTF ...

Collectives™ on Stack Overflow

Extract numbers from a field in PostgreSQL

5 Answers 5

2 Comments

3 Comments

Comments

6 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

3 Comments

Comments

6 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related