0

I'm having trouble handling HTML containing escaped unicode characters (in the Chinese range) in Python3/BeautifulSoup on Windows. BeautifulSoup seems to function correctly, until I try to print an extracted tag, or write out to file. I have my default encoding set to utf-8, yet a cp1252 codec seems to be getting selected...

To reproduce:

soup = BeautifulSoup("隱")

f = open("out.html", "w")
f.write(soup.text)
f.close()

Stack trace attached.

Traceback (most recent call last):
  File "scrape.py", line 143, in <module>
    test_uni()
  File "scrape.py", line 126, in test_uni
    f.write(soup.text)
  File "c:\venv\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u96b1' in position 0: character maps to <undefined>

1 Answer 1

1

You were trying to write non-english (unicode) string to file which Python expects ascii bytestring at default. This is not about your windows environment.

Encode the text before writing to file should work, and utf-8 should be fine with Chinese characters:

f.write(soup.text.encode('utf-8'))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.