Unicode Decode Error in Python

Question

TDB = csv.reader(codecs.open('data/TDS.csv', 'rb', encoding='utf-8'), delimiter=',', quotechar='"')

ts = db.testCol

for row in TDB:
    print row[1]
    T = {"t":row[1],
             "s": row[0]}
    post_id = ts.insert(T)

I not sure why i can't encode it into utf-8 while i want to insert data into database i must make it in utf8 format.

UnicodeDecodeError: 'utf8' codec can't decode byte 0xf3 in position 36: invalid continuation byte

Before i put the encoding function, i got this from pymongo.

bson.errors.InvalidStringData: strings in documents must be valid UTF-8

and i guess, this is the data it couldn't encode

'compleja e intelectualmente retadora , el ladrÛn de orquÌdeas es uno de esos filmes que vale la pena ver precisamente por su originalidad . '

Anyone know how should i do? Thanks

you're trying to read in (decode) the data as UTF8, not encode. Make sure your file, "TDS.csv" is encoded as UTF8. — monkut
– monkut, Commented Feb 21, 2013 at 6:44
@monkut, may i know what should i do? if i want to make them output as utf8 and save to pymongo. Thanks — 1myb
– 1myb, Commented Feb 21, 2013 at 6:50
you first need to know what encoding the data, "TDS.csv" is. Also, it should be noted that the csv module doesn't support unicode (which is what codecs.open() will return). — monkut
– monkut, Commented Feb 21, 2013 at 7:09
if your file is already in UTF8, you should be able to use the standard open() (not codecs.open()), and not worry about the conversion. — monkut
– monkut, Commented Feb 21, 2013 at 7:10
@monkut dear, it's not utf8 while its tweets extracted via live stream and I not sure it's encoding. Without calling the codec, it will raise an error to me from pymongo about they only accepting utf8 input. Thanks for reply — 1myb
– 1myb, Commented Feb 21, 2013 at 10:09

monkut · Accepted Answer · 2013-02-21 22:52:09Z

1

Ok, this might help..

There are a list of encodings here:

http://docs.python.org/2/library/codecs.html#standard-encodings

latin-1 is a common encoding used for languages in europe.

The basic flow with dealing with encodings is:

read in encoded content
content.decode("source encoding") to unicode
encode from unicode to desired encoding, unicode_content.encode("desired encoding")

You can try going through encodings that seem right and see which ones don't cause an error:

enc = "latin-1"
f = open("TSD.csv", "r")
content = f.read() # raw encoded content
u_content = content.decode(enc) # decodes from enc to unicode
utf8_content = u_content.encode("utf8")

answered Feb 21, 2013 at 22:52

monkut

44.1k26 gold badges133 silver badges161 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

1myb Over a year ago

Thanks. Yesterday i tried this helped too. row[1].decode('latin-1').encode('ascii','xmlcharrefreplace')

Collectives™ on Stack Overflow

Unicode Decode Error in Python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related