0

What does one do with this kind of error? You are reading lines from a file. You don't know the encoding.

What does "byte 0xed" mean? What does "position 3792" mean?

I'll try to answer this myself and repost but I'm slightly annoyed that I'm spending as long as I am figuring this out. Is there a clobber/ignore and continue method for getting past unknown encodings? I just want to read a text file!

Traceback (most recent call last):
  File "./test.py", line 8, in <module>
    for x in fin:
  File "/bns/rma/local/lib/python3.1/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 3792: ordinal not in range(128)
1
  • To read a text file you need it's encoding. The default ascii encoding might work often, but not here. Commented Aug 3, 2011 at 18:12

2 Answers 2

3

0xed is the unicode code for í, which is contained in the input at the position 3792 (that is, if you count starting at the first letter, the 3792nd letter will be í).

You are using the ascii codec to decode the file, but the file is not ascii-encoded, try with a unicode aware codec instead (utf_8 maybe?), or, if you know the encoding used to write the file, choose the appropriate encoding from the full list of available codecs.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks! That answer some of the questions ... but how does one actually pick the encoding? It's non-trivial. I just want to be dumb and push play and read the file as if I was in a text editor and deal with whatever garbage I can see ... Do I just use open(file, 'rb') and deal with the mess? ... but then I don't have strings. I don't quite see what the quick fix is.
I see that there is a chardet module somewhere which has some autodetect features but it appears to be non-standard.
There is no absolute fix if you don't know the encoding. You can try to auto detect it, but if you don't have enough samples, it will fail somewhen.
Never guess the encoding. That trick never works. Make something tell you what the encoding is. Outlaw *.txt files. If you cannot distinguish one 8-bit encoding from another, say MacRoman vs Latin1, then chardet is useless. Which in this case, it is.
0

I think I found the way to be dumb :) :

fin = (x.decode('ascii', 'ignore') for x in fin)

for x in fin: print(x)

where errors='ignore' could be 'replace' or whatever. This at least follows the idiom "garbage in, garbage out" that I am seeking.

1 Comment

And I just noticed the codec module has an optional 'errors' argument that can be set to ignore. i.e.: fin = codecs.open(filename, encoding='ascii', mode='r', errors='ignore')

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.