1

I'm trying to iterate through an XML file (UTF-8 encoded, starts with ) with lxml, but get the following error on the character 丂 :

UnicodeEncodeError: 'cp932' codec can't encode character u'\u4e02' in position 0: illegal multibyte sequence

Other characters before this are printed out correctly. The code is:

parser = etree.XMLParser(encoding='utf-8')
tree = etree.parse("filename.xml", parser)
root = tree.getroot()
for elem in root:
    print elem[0].text

Does the error mean that it didn't parse the file in utf-8 but in shift JIS instead?

2 Answers 2

2

The stacktrace of the UnicodeEncodeError points to the location where the exception occurs. Unfortunately you didn’t include it but it’s most likely the last line where the unicode text is printed to stdout. I assume that stdout uses cp932 encoding on your system.

If my assumptions are correct you should consider changing your environment such that stdout uses an encoding that can represent unicode characters (like UTF-8). (see for example Writing unicode strings via sys.stdout in Python).

Sign up to request clarification or add additional context in comments.

1 Comment

Oh so it was just stdouts encoding, I didn't realize that! I was using it just for testing, so I didn't have a problem after all :D Thank you!
2

I had a similar situation using lxml's objectify. Here's how I was able to fix it.

import unicodedata
my_name = root.name.text
if isinstance(my_name, unicode):
    # Decode to string.
    my_name = unicodedata.normalize('NFKD', my_name).encode('ascii','ignore')

1 Comment

Worked perfectly for r = requests.get(...) that would not work in objectify.XML(r.text)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.