binary -> UTF-8 -> string

Question

I'm trying to understand Unicode and all asociated things. I have made an utf-8.txt file which obviously is encoded in utf-8. It has "Hello world!" inside. Heres what I do:

f = open('utf8.txt', mode = 'r', encoding = 'utf8')
f.read()

What I get is: '\ufeffHello world!' where did the prefix came from?

Another try:

f = open('utf8.txt', 'rb')
byte = f.read()

printing byte gives: b'\xef\xbb\xbfHello world!' I assume that prefix came in as hex.

byte.decode('utf8')

above code again gives me: '\ufeffHello world!'

What am I doing wrong? How to retrive text to python from utf-8 file?

Thanks for feedback!

Whatever editor you used to save the file, it added an UTF-8 BOM at the beginning of the file, which is explicitly discouraged. Get a better editor. — Matteo Italia
– Matteo Italia, Commented Mar 15, 2016 at 20:15
Bear in mind, "Hello world!" is UTF-8, ASCII, ISO-8859-1, ISO-8859-15, Windows-1252 etc etc etc. Things only getting interesting after 0x7F — Alastair McCormack
– Alastair McCormack, Commented Mar 15, 2016 at 22:15

Yunhe · Accepted Answer · 2016-03-15 20:43:49Z

7

Your utf-8.txt is encoded utf-8-bom which is different from utf-8. For an utf-8-bom file, '\uFEFF' is written into the beginning of the file. Instead of using encoding = 'utf8', try encoding = 'utf-8-sig'

f = open('utf8.txt', mode = 'r', encoding = 'utf-8-sig')
print (f.read())

answered Mar 15, 2016 at 20:43

Yunhe

6655 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Alastair McCormack Over a year ago

To explain, utf-8-sig is a special codec, which automatically removes the BOM on read and adds it on write.

Collectives™ on Stack Overflow

binary -> UTF-8 -> string

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related