3

The title sums it up pretty well. I'm looking for a regular expression matching Unicode uppercase character for the Postgres ~ operator. The obvious way doesn't work:

=> select 'A' ~ '[[:upper:]]';
 ?column? 
----------
 t
(1 row)

=> select 'Ó' ~ '[[:upper:]]';
 ?column? 
----------
 t
(1 row)

=> select 'Ą' ~ '[[:upper:]]';
 ?column? 
----------
 f
(1 row)

I'm using Postgresql 9.1 and my locale is set to pl_PL.UTF-8. The ordering works fine.

=> show LC_CTYPE;
  lc_ctype   
-------------
 pl_PL.UTF-8
(1 row)
5
  • 1
    not a proper answer, but Ą matches [[:upper:]] on my local postgresql 9.2.1 (but not 9.1.6) Commented Jan 11, 2013 at 12:27
  • @araqnid Matches for me also in 9.2. What if you try with the collation in 9.1?: select 'Ą' ~ '[[:upper:]]' collate "pl_PL" Commented Jan 11, 2013 at 12:31
  • @Clodoaldo explicitly specifying the collation makes no difference to the results Commented Jan 11, 2013 at 12:38
  • So it looks like a bug which has been finally fixed in 9.2? Commented Jan 11, 2013 at 12:53
  • 1
    May be related to bug #6457 that got fixed when it was reported (but not mentioned in release notes as far as I can see) Commented Jan 11, 2013 at 14:22

2 Answers 2

4

The regexp engine of PG 9.1 and older versions does not correctly classify characters whose codepoint doesn't fit it one byte. The codepoint of 'Ó' being 211 it gets it right, but the codepoint of 'Ą' is 260, beyond 255.

PG 9.2 is better at this, though still not 100% right for all alphabets. See this commit in PostgreSQL source code, and particularly these parts of the comment:

remove the hard-wired limitation to not consider wctype.h results for character codes above 255

and

Still, we can push it up to U+7FF (which I chose as the limit of 2-byte UTF8 characters), which will at least make Eastern Europeans happy pending a better solution

Unfortunately this was not backported to 9.1

Sign up to request clarification or add additional context in comments.

Comments

1

I've found that perl regular expressions handles Unicode perfectly.

create extension plperl;

create function is_letter_upper(text) returns boolean
immutable strict language plperl
as $$
    use feature 'unicode_strings';
    return $_[0] =~ /^\p{IsUpper}$/ ? "true" : "false";
$$;

Tested on postgres 9.2 with perl 5.16.2.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.