What is the correct way to use unicode characters in a python regex

Question

In the process of scraping some documents using Python 2.7, I've run into some annoying page separators, which I've decided to remove. The separators use some funky characters. I already asked one question here on how to make these characters reveal their utf-8 codes. There are two non-ASCII characters used: '\xc2\xad', and '\x0c'. Now, I just need to remove these characters, as well some spaces and the page numbers.

Elsewhere on SO, I've seen unicode characters used in tandem with regexps, but it's in a strange format that I do not have these characters in, e.g. '\u00ab'. In addition, none of them are using ASCII as well as non-ASCII characters. Finally, the python docs are very light on the subject of unicode in regexes... something about flags... I don't know. Can anyone help?

Here is my current usage, which does not do what I want:

re.sub('\\xc2\\xad\s\d+\s\\xc2\\xad\s\\x0c', '', my_str)

I guess it would be helpful to point you toward Joel and deceze — georg
– georg, Commented Sep 25, 2013 at 9:31
Read the Joel before. So should I infer that the difficulty I'm having is just my confusion about what unicode is? — Brian Peterson
– Brian Peterson, Commented Sep 25, 2013 at 9:33
It can be. Could you describe your input more precisely (e.g. what repr(my_str) says)? — georg
– georg, Commented Sep 25, 2013 at 9:42
Ok, it appears to be an utf8-encoded byte string. So your options are either 1) replace verbatim bytes in that string or 2) convert it to unicode and replace characters. — georg
– georg, Commented Sep 25, 2013 at 9:58

Bohemian · Accepted Answer · 2013-09-25 15:55:19Z

2

Rather than seek out specific unwanted chars, you could remove everything not wanted:

re.sub('[^\\s!-~]', '', my_str)

This throws away all characters not:

whitespace (spaces, tabs, newlines, etc)
printable "normal" ascii characters (! is the first printable char and ~ is the last under decimal 128)

You could include more chars if needed - just adjust the character class.

answered Sep 25, 2013 at 15:55

Bohemian♦

427k103 gold badges604 silver badges750 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Brian Peterson Over a year ago

This is smart. The only problem is that the so-called 'soft hyphen', '–', is used over and over again, and is part of my regex for capturing data. At the same time, it is also part of what I was hoping to remove. Sometimes, the OCR technology inserted page breaks that look like, e.g., '– 9 –\x0c'. Usually, the breaks are found in between the data I'm trying to capture. Occasionally, though, it comes right in the middle of a sentence. Thus, I AM only looking for specific instances...

Brian Peterson Over a year ago

Perhaps, though, I could do an initial sweep through the document, and replace all instances of '–' with '--'. This would also convert the specific instances I'm now trying to remove. I could just drop all instances of '\x0c' as well, and then I have a simple, pure 1-byte regex to deal with, and sidestep the unicode regex.

Nozar Safari · Accepted Answer · 2018-08-14 11:04:29Z

0

i have same problem, i know this in not efficient way but in my case worked

 result = re.sub(r"\\" ,",x,x",result)
 result = re.sub(r",x,xu00ad" ,"",result)
 result = re.sub(r",x,xu" ,"\\u",result)

answered Aug 14, 2018 at 11:04

Nozar Safari

5054 silver badges17 bronze badges

Collectives™ on Stack Overflow

What is the correct way to use unicode characters in a python regex

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related