lxml encoding error when parsing utf8 xml

Question

I'm trying to iterate through an XML file (UTF-8 encoded, starts with ) with lxml, but get the following error on the character 丂 :

UnicodeEncodeError: 'cp932' codec can't encode character u'\u4e02' in position 0: illegal multibyte sequence

Other characters before this are printed out correctly. The code is:

parser = etree.XMLParser(encoding='utf-8')
tree = etree.parse("filename.xml", parser)
root = tree.getroot()
for elem in root:
    print elem[0].text

Does the error mean that it didn't parse the file in utf-8 but in shift JIS instead?

Community · Accepted Answer · 2017-05-23 12:22:43Z

2

The stacktrace of the UnicodeEncodeError points to the location where the exception occurs. Unfortunately you didn’t include it but it’s most likely the last line where the unicode text is printed to stdout. I assume that stdout uses cp932 encoding on your system.

If my assumptions are correct you should consider changing your environment such that stdout uses an encoding that can represent unicode characters (like UTF-8). (see for example Writing unicode strings via sys.stdout in Python).

edited May 23, 2017 at 12:22

CommunityBot

11 silver badge

answered Dec 7, 2012 at 15:28

Benedikt Waldvogel

13k8 gold badges56 silver badges61 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

blub Over a year ago

Oh so it was just stdouts encoding, I didn't realize that! I was using it just for testing, so I didn't have a problem after all :D Thank you!

paragbaxi · Accepted Answer · 2013-10-08 21:33:15Z

2

I had a similar situation using lxml's objectify. Here's how I was able to fix it.

import unicodedata
my_name = root.name.text
if isinstance(my_name, unicode):
    # Decode to string.
    my_name = unicodedata.normalize('NFKD', my_name).encode('ascii','ignore')

answered Oct 8, 2013 at 21:33

paragbaxi

4,26511 gold badges48 silver badges60 bronze badges

1 Comment

Juha Untinen Over a year ago

Worked perfectly for r = requests.get(...) that would not work in objectify.XML(r.text)

Collectives™ on Stack Overflow

lxml encoding error when parsing utf8 xml

2 Answers 2

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related