3

In the process of scraping some documents using Python 2.7, I've run into some annoying page separators, which I've decided to remove. The separators use some funky characters. I already asked one question here on how to make these characters reveal their utf-8 codes. There are two non-ASCII characters used: '\xc2\xad', and '\x0c'. Now, I just need to remove these characters, as well some spaces and the page numbers.

Elsewhere on SO, I've seen unicode characters used in tandem with regexps, but it's in a strange format that I do not have these characters in, e.g. '\u00ab'. In addition, none of them are using ASCII as well as non-ASCII characters. Finally, the python docs are very light on the subject of unicode in regexes... something about flags... I don't know. Can anyone help?

Here is my current usage, which does not do what I want:

re.sub('\\xc2\\xad\s\d+\s\\xc2\\xad\s\\x0c', '', my_str)
10
  • I guess it would be helpful to point you toward Joel and deceze Commented Sep 25, 2013 at 9:31
  • Read the Joel before. So should I infer that the difficulty I'm having is just my confusion about what unicode is? Commented Sep 25, 2013 at 9:33
  • It can be. Could you describe your input more precisely (e.g. what repr(my_str) says)? Commented Sep 25, 2013 at 9:42
  • 4
    Ok, it appears to be an utf8-encoded byte string. So your options are either 1) replace verbatim bytes in that string or 2) convert it to unicode and replace characters. Commented Sep 25, 2013 at 9:58
  • 1
    Look out for zero-width spaces there! Commented Sep 25, 2013 at 15:05

2 Answers 2

2

Rather than seek out specific unwanted chars, you could remove everything not wanted:

re.sub('[^\\s!-~]', '', my_str)

This throws away all characters not:

  • whitespace (spaces, tabs, newlines, etc)
  • printable "normal" ascii characters (! is the first printable char and ~ is the last under decimal 128)

You could include more chars if needed - just adjust the character class.

Sign up to request clarification or add additional context in comments.

2 Comments

This is smart. The only problem is that the so-called 'soft hyphen', '–', is used over and over again, and is part of my regex for capturing data. At the same time, it is also part of what I was hoping to remove. Sometimes, the OCR technology inserted page breaks that look like, e.g., '– 9 –\x0c'. Usually, the breaks are found in between the data I'm trying to capture. Occasionally, though, it comes right in the middle of a sentence. Thus, I AM only looking for specific instances...
Perhaps, though, I could do an initial sweep through the document, and replace all instances of '–' with '--'. This would also convert the specific instances I'm now trying to remove. I could just drop all instances of '\x0c' as well, and then I have a simple, pure 1-byte regex to deal with, and sidestep the unicode regex.
0

i have same problem, i know this in not efficient way but in my case worked

 result = re.sub(r"\\" ,",x,x",result)
 result = re.sub(r",x,xu00ad" ,"",result)
 result = re.sub(r",x,xu" ,"\\u",result)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.