37

I have a table with a column po_number of type varchar in Postgres 8.4. It stores alphanumeric values with some special characters. I want to ignore the characters [/alpha/?/$/encoding/.] and check if the column contains a number or not. If its a number then it needs to typecast as number or else pass null, as my output field po_number_new is a number field.

Below is the example:

example

SQL Fiddle.

I tired this statement:

select 
(case when  regexp_replace(po_number,'[^\w],.-+\?/','') then po_number::numeric
else null
end) as po_number_new from test

But I got an error for explicit cast:

error

2
  • Undelete your new question, you will get an answer. Just explain all edge cases carefully, that's all. Commented Nov 14, 2016 at 9:39
  • It's ok Tim. I will post a proper question from now-on without any edits : ) no issues. By the way thanks : ) Commented Nov 14, 2016 at 9:46

5 Answers 5

85

Simply:

SELECT NULLIF(regexp_replace(po_number, '\D','','g'), '')::numeric AS result
FROM   tbl;

\D being the class shorthand for "not a digit".
And you need the 4th parameter 'g' (for "globally") to replace all occurrences.
Details in the manual.

For a known, limited set of characters to replace, plain string manipulation functions like replace() or translate() are substantially cheaper. Regular expressions are just more versatile, and we want to eliminate everything but digits in this case. Related:

But why Postgres 8.4? Consider upgrading to a modern version.

Consider pitfalls for outdated versions:

Sign up to request clarification or add additional context in comments.

2 Comments

You might be interested in my proposed solution to this question - using the TRANSLATE function in this particularly easy case instead of the more expensive REGEXP_REPLACE? I'd be interested in, and grateful for, any comment(s) you may have on what I wrote! What led me here was this question - a lot tricker! :-) I have a working solution for PostgreSQL and a guy I was chatting with came up with a SQL Server one - trying to do it without regexes tricky though!
Just a note, I had floating point numbers and this solution converted foobar 40.0 into 400. Using .\D instead of just \D worked as desired in my case (obviously not working if you have other . in your strings)
3

I think you want something like this:

select (case when regexp_replace(po_number, '[^\w],.-+\?/', '') ~ '^[0-9]+$'
             then regexp_replace(po_number, '[^\w],.-+\?/', '')::numeric
        end) as po_number_new 
from test;

That is, you need to do the conversion on the string after replacement.

Note: This assumes that the "number" is just a string of digits.

3 Comments

Could you explain the regexp ,.-+? What it means?
@Abelisto I think they should be inside the brackets, though I'm not in front of Postgre right now to test it.
I am getting only nulls after using the above... instead I need numbers from each record and nulls for records that doesn't have numbers in it
3

If you want to extract floating numbers try to use this:

SELECT NULLIF(regexp_replace(po_number, '[^\.\d]','','g'), '')::numeric AS result FROM tbl;

It's the same as Erwin Brandstetter answer but with different expression:

[^...] - match any character except a list of excluded characters, put the excluded charaters instead of ...

\. - point character (also you can change it to , char)

\d - digit character

Comments

2

The logic I would use to determine if the po_number field contains numeric digits is that its length should decrease when attempting to remove numeric digits.

If so, then all non numeric digits ([^\d]) should be removed from the po_number column. Otherwise, NULL should be returned.

select case when char_length(regexp_replace(po_number, '\d', '', 'g')) < char_length(po_number)
            then regexp_replace(po_number, '[^0-9]', '', 'g')
            else null
       end as po_number_new
from test

6 Comments

I am getting an SQL error [42883] after trying this
@user1538020 The error was being caused by you using Postgres 8.x, which does not have a length function. I updated to use char_length and it should work now.
I did a select version(); I am now using PostgreSQL 9.5.2, compiled by Visual C++ build 1800, 64-bit. Still getting an error. I posted the image error.
@user1538020: Asides: The error you get is due to a typo: regex_replace <> regexp_replace. And there is a length() function in pg 8.4.
@Tim Biegeleisen : select case when char_length(regexp_replace(po_number, '\d', '', 'g')) < char_length(po_number) then regexp_replace(po_number, '[^0-9]', '', 'g') else null end as po_number_new from test; Corrected the syntax, it now works. thanks..
|
2

Since version 12 - that's 2 years + 4 months ago at the time of writing (but after the last edit that I can see on the accepted answer), you could use a GENERATED FIELD to do this quite easily on a one-time basis rather than having to calculate it each time you wish to SELECT a new po_number.

Furthermore, you can use the TRANSLATE function to extract your digits which is less expensive than the REGEXP_REPLACE solution proposed by @ErwinBrandstetter!

I would do this as follows (all of the code below is available on the fiddle here):

CREATE TABLE s
(
  num TEXT,
  
  new_num INTEGER GENERATED ALWAYS AS
    (NULLIF(TRANSLATE(num, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ. ', ''), '')::INTEGER) STORED
);

You can add to the 'ABCDEFG... string in the TRANSLATE function as appropriate - I have decimal point (.) and a space ( ) at the end - you may wish to have more characters there depending on your input!

And checking:

INSERT INTO s VALUES ('2'), (''), (NULL), (' ');
INSERT INTO t VALUES ('2'), (''), (NULL), (' ');
SELECT * FROM s;
SELECT * FROM t;

Result (same for both):

num    new_num
  2          2
          NULL
          NULL
          NULL

So, I wanted to check how efficient my solution was, so I ran the following test inserting 10,000 records into both tables s and t as follows (from here):

EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
INSERT INTO t 
with symbols(characters) as 
(
  VALUES ('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789')
)
select string_agg(substr(characters, (random() * length(characters) + 1) :: INTEGER, 1), '')
from symbols
join generate_series(1,10) as word(chr_idx) on 1 = 1 -- word length
join generate_series(1,10000) as words(idx) on 1 = 1 -- # of words
group by idx;

The differences weren't that huge but the regex solution was consistently slower by about 25% - even changing the order of the tables undergoing the INSERTs.

However, where the TRANSLATE solution really shines is when doing a "raw" SELECT as follows:

EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
  NULLIF(TRANSLATE(num, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ. ', ''), '')::INTEGER
FROM s;

and the same for the REGEXP_REPLACE solution.

The differences were very marked, the TRANSLATE taking approx. 25% of the time of the other function. Finally, in the interests of fairness, I also did this for both tables:

EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
  num, new_num
FROM t;

Both extremely quick and identical!

1 Comment

Yes, regular expression functions are known to be substantially more expensive (got faster in recent versions, but still). I've kept pointing that out over the last decade: stackoverflow.com/a/28172693/939860, stackoverflow.com/a/34106732/939860, ... However, listing all possible non-digits was somewhat more practical in times of ASCII, not so much with UTF ...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.