python encoding error

Question

What does one do with this kind of error? You are reading lines from a file. You don't know the encoding.

What does "byte 0xed" mean? What does "position 3792" mean?

I'll try to answer this myself and repost but I'm slightly annoyed that I'm spending as long as I am figuring this out. Is there a clobber/ignore and continue method for getting past unknown encodings? I just want to read a text file!

Traceback (most recent call last):
  File "./test.py", line 8, in <module>
    for x in fin:
  File "/bns/rma/local/lib/python3.1/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 3792: ordinal not in range(128)

To read a text file you need it's encoding. The default ascii encoding might work often, but not here. — Jochen Ritzel
– Jochen Ritzel, Commented Aug 3, 2011 at 18:12

GaretJax · Accepted Answer · 2011-08-03 18:08:32Z

3

0xed is the unicode code for í, which is contained in the input at the position 3792 (that is, if you count starting at the first letter, the 3792nd letter will be í).

You are using the ascii codec to decode the file, but the file is not ascii-encoded, try with a unicode aware codec instead (utf_8 maybe?), or, if you know the encoding used to write the file, choose the appropriate encoding from the full list of available codecs.

answered Aug 3, 2011 at 18:08

GaretJax

7,8491 gold badge41 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

safetyduck Over a year ago

Thanks! That answer some of the questions ... but how does one actually pick the encoding? It's non-trivial. I just want to be dumb and push play and read the file as if I was in a text editor and deal with whatever garbage I can see ... Do I just use open(file, 'rb') and deal with the mess? ... but then I don't have strings. I don't quite see what the quick fix is.

safetyduck Over a year ago

I see that there is a chardet module somewhere which has some autodetect features but it appears to be non-standard.

GaretJax Over a year ago

There is no absolute fix if you don't know the encoding. You can try to auto detect it, but if you don't have enough samples, it will fail somewhen.

tchrist Over a year ago

Never guess the encoding. That trick never works. Make something tell you what the encoding is. Outlaw *.txt files. If you cannot distinguish one 8-bit encoding from another, say MacRoman vs Latin1, then chardet is useless. Which in this case, it is.

safetyduck · Accepted Answer · 2011-08-03 18:28:54Z

0

I think I found the way to be dumb :) :

fin = (x.decode('ascii', 'ignore') for x in fin)

for x in fin: print(x)

where errors='ignore' could be 'replace' or whatever. This at least follows the idiom "garbage in, garbage out" that I am seeking.

answered Aug 3, 2011 at 18:28

safetyduck

6,91015 gold badges66 silver badges116 bronze badges

1 Comment

safetyduck Over a year ago

And I just noticed the codec module has an optional 'errors' argument that can be set to ignore. i.e.: fin = codecs.open(filename, encoding='ascii', mode='r', errors='ignore')

Collectives™ on Stack Overflow

python encoding error

2 Answers 2

4 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related