5

I'm trying to understand Unicode and all asociated things. I have made an utf-8.txt file which obviously is encoded in utf-8. It has "Hello world!" inside. Heres what I do:

f = open('utf8.txt', mode = 'r', encoding = 'utf8')
f.read()

What I get is: '\ufeffHello world!' where did the prefix came from?

Another try:

f = open('utf8.txt', 'rb')
byte = f.read()

printing byte gives: b'\xef\xbb\xbfHello world!' I assume that prefix came in as hex.

byte.decode('utf8')

above code again gives me: '\ufeffHello world!'

What am I doing wrong? How to retrive text to python from utf-8 file?

Thanks for feedback!

2
  • 2
    Whatever editor you used to save the file, it added an UTF-8 BOM at the beginning of the file, which is explicitly discouraged. Get a better editor. Commented Mar 15, 2016 at 20:15
  • 1
    Bear in mind, "Hello world!" is UTF-8, ASCII, ISO-8859-1, ISO-8859-15, Windows-1252 etc etc etc. Things only getting interesting after 0x7F Commented Mar 15, 2016 at 22:15

1 Answer 1

7

Your utf-8.txt is encoded utf-8-bom which is different from utf-8. For an utf-8-bom file, '\uFEFF' is written into the beginning of the file. Instead of using encoding = 'utf8', try encoding = 'utf-8-sig'

f = open('utf8.txt', mode = 'r', encoding = 'utf-8-sig')
print (f.read())
Sign up to request clarification or add additional context in comments.

1 Comment

To explain, utf-8-sig is a special codec, which automatically removes the BOM on read and adds it on write.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.