Python regex tokenization of Unicode string not working as expected [duplicate]

Question

I encounter a strange problem with regular expression tokenization and Unicode strings.

> mystring = "Unicode rägular expressions"
> tokens = re.findall(r'\w+', mystring, re.UNICODE)

This is what I get:

> print tokens
['Unicode', 'r\xc3', 'gular', 'expressions']

This is what I expected:

> print tokens
['Unicode', 'rägular', 'expressions']

What do I have to do to get the expected result?

Update: This question is different from mine: matching unicode characters in python regular expressions But it's answer https://stackoverflow.com/a/5028826/1251687 would have solved my problem, too.

At issue here is that you are trying to match encoded bytes, not Unicode codepoints. — Martijn Pieters
– Martijn Pieters, Commented Apr 18, 2015 at 17:42

Javier · Accepted Answer · 2015-04-18 17:41:18Z

2

The string must be unicode.

mystring = u"Unicode rägular expressions"
tokens = re.findall(r'\w+', mystring, re.UNICODE)

answered Apr 18, 2015 at 17:41

Javier

2,78718 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

boadescriptor Over a year ago

That's it. Python Unicode is a massive headache...

Martijn Pieters Over a year ago

@boadescriptor: Watch nedbatchelder.com/text/unipain.html to reduce that headache. Stick to the Unicode sandwich, avoid handling encoded bytes as much as you can.

Martijn Pieters · Accepted Answer · 2015-04-18 17:44:03Z

1

You have Latin-1 or Windows Codepage 1252 bytes, not Unicode text. Decode your input:

tokens = re.findall(r'\w+', mystring.decode('cp1252'), re.UNICODE)

An encoded byte can mean anything depending on the codec used, it is not a specific Unicode codepoint. For byte strings (type str) only ASCII characters can be matched when using \w.

answered Apr 18, 2015 at 17:44

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Collectives™ on Stack Overflow

Python regex tokenization of Unicode string not working as expected [duplicate]

2 Answers 2

2 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Linked

Related