0
TDB = csv.reader(codecs.open('data/TDS.csv', 'rb', encoding='utf-8'), delimiter=',', quotechar='"')

ts = db.testCol

for row in TDB:
    print row[1]
    T = {"t":row[1],
             "s": row[0]}
    post_id = ts.insert(T)

I not sure why i can't encode it into utf-8 while i want to insert data into database i must make it in utf8 format.

UnicodeDecodeError: 'utf8' codec can't decode byte 0xf3 in position 36: invalid continuation byte

Before i put the encoding function, i got this from pymongo.

bson.errors.InvalidStringData: strings in documents must be valid UTF-8

and i guess, this is the data it couldn't encode

'compleja e intelectualmente retadora , el ladrÛn de orquÌdeas es uno de esos filmes que vale la pena ver precisamente por su originalidad . '

Anyone know how should i do? Thanks

5
  • you're trying to read in (decode) the data as UTF8, not encode. Make sure your file, "TDS.csv" is encoded as UTF8. Commented Feb 21, 2013 at 6:44
  • @monkut, may i know what should i do? if i want to make them output as utf8 and save to pymongo. Thanks Commented Feb 21, 2013 at 6:50
  • you first need to know what encoding the data, "TDS.csv" is. Also, it should be noted that the csv module doesn't support unicode (which is what codecs.open() will return). Commented Feb 21, 2013 at 7:09
  • if your file is already in UTF8, you should be able to use the standard open() (not codecs.open()), and not worry about the conversion. Commented Feb 21, 2013 at 7:10
  • @monkut dear, it's not utf8 while its tweets extracted via live stream and I not sure it's encoding. Without calling the codec, it will raise an error to me from pymongo about they only accepting utf8 input. Thanks for reply Commented Feb 21, 2013 at 10:09

1 Answer 1

1

Ok, this might help..

There are a list of encodings here:

http://docs.python.org/2/library/codecs.html#standard-encodings

latin-1 is a common encoding used for languages in europe.

The basic flow with dealing with encodings is:

  1. read in encoded content
  2. content.decode("source encoding") to unicode
  3. encode from unicode to desired encoding, unicode_content.encode("desired encoding")

You can try going through encodings that seem right and see which ones don't cause an error:

enc = "latin-1"
f = open("TSD.csv", "r")
content = f.read() # raw encoded content
u_content = content.decode(enc) # decodes from enc to unicode
utf8_content = u_content.encode("utf8")
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks. Yesterday i tried this helped too. row[1].decode('latin-1').encode('ascii','xmlcharrefreplace')

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.