1

I'm writing a web crawler and need to save the html from the webpage I crawled into my MongoDB database. This is what I'm trying to do(I'm using pymongo):

        c=urllib2.urlopen(myUrl)
        html=c.read()
        db.urls.insert(
                {
                    "url":myUrl,
                    "HTML":html
                }
        )

When I run my script, I get the following error:

InvalidStringData: strings in documents must be valid UTF-8

I tried looking up my problem and figured out that I need to process the HTML somehow before saving it, so it's UTF-8 compatible, but I couldn't find how.

I don't think my question is a duplicate of python encoding utf-8 since I do not see how that question is related to HTML. If I'm wrong, or my problem has nothing to do with HTML, please direct me.

1
  • possible duplicate of python encoding utf-8 Commented Jun 15, 2015 at 14:20

1 Answer 1

0

To transform from string to utf

html.decode('utf8')

This encodes to utf8 your string content.

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for copying from the duplicate and by the way it's "encode" and not "decode" that is required here.
it says is not a duplicate ... anyways if you have a string s = 'asdasdc' ; type(s) str; s.encode('utf8') 'asdasdc' ; s.decode('utf8') u'asdasdc' so the last one is unicode
@user3561036 this is not a duplicate dude, neither encode or decode work for me. With both of them I get UnicodeDecodeError
what python version do you use ? i tested on 2.6 dude.
btw - you can try to convert the data to binary ... will insert into the db but not sure will allow text searches - use api.mongodb.org/python/current/api/bson/binary.html

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.