0

I encounter a strange problem with regular expression tokenization and Unicode strings.

> mystring = "Unicode rägular expressions"
> tokens = re.findall(r'\w+', mystring, re.UNICODE)

This is what I get:

> print tokens
['Unicode', 'r\xc3', 'gular', 'expressions']

This is what I expected:

> print tokens
['Unicode', 'rägular', 'expressions']

What do I have to do to get the expected result?

Update: This question is different from mine: matching unicode characters in python regular expressions But it's answer https://stackoverflow.com/a/5028826/1251687 would have solved my problem, too.

6
  • \w does not include Unicode characters like ä. Commented Apr 18, 2015 at 17:39
  • what's the way to do it then? Commented Apr 18, 2015 at 17:40
  • \w includes unicode if you use re.UNICODE. Commented Apr 18, 2015 at 17:41
  • @Xufox: it does when you use the re.UNICODE flag. Commented Apr 18, 2015 at 17:42
  • 1
    At issue here is that you are trying to match encoded bytes, not Unicode codepoints. Commented Apr 18, 2015 at 17:42

2 Answers 2

2

The string must be unicode.

mystring = u"Unicode rägular expressions"
tokens = re.findall(r'\w+', mystring, re.UNICODE)
Sign up to request clarification or add additional context in comments.

2 Comments

That's it. Python Unicode is a massive headache...
@boadescriptor: Watch nedbatchelder.com/text/unipain.html to reduce that headache. Stick to the Unicode sandwich, avoid handling encoded bytes as much as you can.
1

You have Latin-1 or Windows Codepage 1252 bytes, not Unicode text. Decode your input:

tokens = re.findall(r'\w+', mystring.decode('cp1252'), re.UNICODE)

An encoded byte can mean anything depending on the codec used, it is not a specific Unicode codepoint. For byte strings (type str) only ASCII characters can be matched when using \w.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.