MongoDB InvalidDocument: Cannot encode object

Question

I am using scrapy to scrap blogs and then store the data in mongodb. At first i got the InvalidDocument Exception. So obvious to me is that the data is not in the right encoding. So before persisting the object, in my MongoPipeline i check if the document is in 'utf-8 strict', and only then i try to persist the object to mongodb. BUT Still i get InvalidDocument Exceptions, now that is annoying.

This is my code my MongoPipeline Object that persists objects to mongodb

# -*- coding: utf-8 -*-

# Define your item pipelines here
#

import pymongo
import sys, traceback
from scrapy.exceptions import DropItem
from crawler.items import BlogItem, CommentItem


class MongoPipeline(object):
    collection_name = 'master'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'posts')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):

        if type(item) is BlogItem:
            try:
                if 'url' in item:
                    item['url'] = item['url'].encode('utf-8', 'strict')
                if 'domain' in item:
                    item['domain'] = item['domain'].encode('utf-8', 'strict')
                if 'title' in item:
                    item['title'] = item['title'].encode('utf-8', 'strict')
                if 'date' in item:
                    item['date'] = item['date'].encode('utf-8', 'strict')
                if 'content' in item:
                    item['content'] = item['content'].encode('utf-8', 'strict')
                if 'author' in item:
                    item['author'] = item['author'].encode('utf-8', 'strict')

            except:  # catch *all* exceptions
                e = sys.exc_info()[0]
                spider.logger.critical("ERROR ENCODING %s", e)
                traceback.print_exc(file=sys.stdout)
                raise DropItem("Error encoding BLOG %s" % item['url'])

            if 'comments' in item:
                comments = item['comments']
                item['comments'] = []

                try:
                    for comment in comments:
                        if 'date' in comment:
                            comment['date'] = comment['date'].encode('utf-8', 'strict')
                        if 'author' in comment:
                            comment['author'] = comment['author'].encode('utf-8', 'strict')
                        if 'content' in comment:
                            comment['content'] = comment['content'].encode('utf-8', 'strict')

                        item['comments'].append(comment)

                except:  # catch *all* exceptions
                    e = sys.exc_info()[0]
                    spider.logger.critical("ERROR ENCODING COMMENT %s", e)
                    traceback.print_exc(file=sys.stdout)

        self.db[self.collection_name].insert(dict(item))

        return item

And still i get the following exception:

au coeur de l\u2019explosion de la bulle Internet n\u2019est probablement pas \xe9tranger au succ\xe8s qui a suivi. Mais franchement, c\u2019est un peu court comme argument !Ce que je sais dire, compte tenu de ce qui pr\xe9c\xe8de, c\u2019est quelles sont les conditions pour r\xe9ussir si l\u2019on est vraiment contraint de rester en France. Ce sont des sujets que je d\xe9velopperai dans un autre article.',
     'date': u'2012-06-27T23:21:25+00:00',
     'domain': 'reussir-sa-boite.fr',
     'title': u'Peut-on encore entreprendre en France ?\t\t\t ',
     'url': 'http://www.reussir-sa-boite.fr/peut-on-encore-entreprendre-en-france/'}
    Traceback (most recent call last):
      File "h:\program files\anaconda\lib\site-packages\twisted\internet\defer.py", line 588, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "H:\PDS\BNP\crawler\crawler\pipelines.py", line 76, in process_item
        self.db[self.collection_name].insert(dict(item))
      File "h:\program files\anaconda\lib\site-packages\pymongo\collection.py", line 409, in insert
        gen(), check_keys, self.uuid_subtype, client)
    InvalidDocument: Cannot encode object: {'author': 'Arnaud Lemasson',
     'content': 'Tellement vrai\xe2\x80\xa6 Il faut vraiment \xc3\xaatre motiv\xc3\xa9 aujourd\xe2\x80\x99hui pour monter sa bo\xc3\xaete. On est pr\xc3\xa9lev\xc3\xa9 de partout, je ne pense m\xc3\xaame pas \xc3\xa0 embaucher, cela me co\xc3\xbbterait bien trop cher. Bref, 100% d\xe2\x80\x99accord avec vous. Le probl\xc3\xa8me, je ne vois pas comment cela pourrait changer avec le gouvernement actuel\xe2\x80\xa6 A moins que si, j\xe2\x80\x99ai pu lire il me semble qu\xe2\x80\x99ils avaient en t\xc3\xaate de r\xc3\xa9duire l\xe2\x80\x99IS pour les petites entreprises et de l\xe2\x80\x99augmenter pour les grandes\xe2\x80\xa6 A voir',
     'date': '2012-06-27T23:21:25+00:00'}
    2015-11-04 15:29:15 [scrapy] INFO: Closing spider (finished)
    2015-11-04 15:29:15 [scrapy] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 259,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 252396,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 11, 4, 14, 29, 15, 701000),
     'log_count/DEBUG': 2,
     'log_count/ERROR': 1,
     'log_count/INFO': 7,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start)
    time': datetime.datetime(2015, 11, 4, 14, 29, 13, 191000)}

Another funny thing from the comment of @eLRuLL i did the following:

>>> s = "Tellement vrai\xe2\x80\xa6 Il faut vraiment \xc3\xaatre motiv\xc3\xa9 aujourd\xe2\x80\x99hui pour monter sa bo\xc3\xaete. On est pr\xc3\xa9lev\xc3\xa9 de partout, je ne pense m\xc3\xaame pas \xc3\xa0 embaucher, cela me"
>>> s
'Tellement vrai\xe2\x80\xa6 Il faut vraiment \xc3\xaatre motiv\xc3\xa9 aujourd\xe2\x80\x99hui pour monter sa bo\xc3\xaete. On est pr\xc3\xa9lev\xc3\xa9 de partout, je ne pense m\xc3\xaame pas \xc3\xa0 embaucher, cela me'
>>> se = s.encode("utf8", "strict")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14: ordinal not in range(128)
>>> se = s.encode("utf-8", "strict")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14: ordinal not in range(128)
>>> s.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14: ordinal not in range(128)

Then my question is. If this text cannot be encoded. Then why, is my MongoPipeline try catch not catching this EXCEPTION? Because only objects that don't raise any exception should be appended to item['comments'] ?

have you tried first converting the item to dict and then updating every field? — eLRuLL
– eLRuLL, Commented Nov 4, 2015 at 14:57
@eLRuLL As you suggested i tried converting the item to dict, then updating all the fields by encoded utf8 strict values, But that too raises the same InvalidDocumentException — Codious-JR
– Codious-JR, Commented Nov 4, 2015 at 15:47

Codious-JR · Accepted Answer · 2015-11-04 17:08:37Z

15

Finally I figured it out. The problem was not with encoding. It was with the structure of the documents.

Because i went off on the standard MongoPipeline example which does not deal with nested scrapy items.

What i am doing is: BlogItem: "url" ... comments = [CommentItem]

So my BlogItem has a list of CommentItems. Now the problem came here, for persisting the object in the database i do:

self.db[self.collection_name].insert(dict(item))

So here i am parsing the BlogItem to a dict. But i am not parsing the list of CommentItems. And because the traceback displays the CommentItem kind of like a dict, It did not occur to me that the problematic object is not a dict!

So finally the the way to fix this problem is to change the line when appending the comment to the comment list as such:

item['comments'].append(dict(comment))

Now MongoDB considers it as a valid document.

Lastly, for the last part where i ask why i am getting a exception on the python console and not in the script.

The reason is because i was working on the python console, which only supports ascii. And thus the error.

answered Nov 4, 2015 at 17:08

Codious-JR

1,7683 gold badges27 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

1UC1F3R616 Over a year ago

A list in my yield has brought me here :b

duhaime · Accepted Answer · 2019-07-05 13:19:22Z

3

I got this error when running a query

db.collection.find({'attr': {'$gte': 20}})

and some records in collection had a non-numeric value for attr.

answered Jul 5, 2019 at 13:19

duhaime

27.9k22 gold badges197 silver badges243 bronze badges

Comments

eLRuLL · Accepted Answer · 2015-11-04 15:21:22Z

1

First, when you do "somestring".encode(...), isn't changing "somestring", but it returns a new encoded string, so you should use something like:

 item['author'] = item['author'].encode('utf-8', 'strict')

and the same for the other fields.

answered Nov 4, 2015 at 15:21

eLRuLL

18.8k9 gold badges79 silver badges106 bronze badges

6 Comments

Codious-JR Over a year ago

The goal was to verify if encoding is possible. If the variables could be encoded to utf8. And if it throws an exception, then i don't include this object. Plus as mongodb by default encodes its objects before persisting, i thought it would be useless to store these encoded objects. Never the less i did as you suggested. But still get the same error. I am updating the question.

eLRuLL Over a year ago

btw, when I try: s = 'Tellement vrai\xe2\x80\xa6 Il...'; s2=s.encode('utf-8', 'strict') I am getting UnicodeDecodeError

Codious-JR Over a year ago

So that would mean, that the comment['content'] was not encoded. Or the obvious encode error that should have been raised, is not raised.

eLRuLL Over a year ago

spider.logger.critical("ERROR ENCODING %s", e) should be spider.logger.critical("ERROR ENCODING %s" % e), better use import logging; logging.critical("error")

Codious-JR Over a year ago

I just verified if the code lines for encoding is executed. And they are. So then. Maybe MongoDB is not happy with utf8 strict? I find that to be unlikely..?

|

Skippy le Grand Gourou · Accepted Answer · 2020-12-17 18:01:48Z

0

I ran into the same error using a numpy array in a Mongo query :

'myField' : { '$in': myList },

The fix was simply to convert the nd.array() into a list :

'myField' : { '$in': list(myList) },

answered Dec 17, 2020 at 18:01

Skippy le Grand Gourou

7,8726 gold badges66 silver badges83 bronze badges

Comments

Eliav Louski · Accepted Answer · 2022-10-25 23:28:00Z

0

in my case it was super stupid yet not easy to notice:

I accidentally wrote

f"indexes_access.{jsonData['index']}: {jsonData['newState']}"

instead of

{f"indexes_access.{jsonData['index']}": f"{jsonData['newState']}"}

(one long string parsed with f strings instead of key and value parsed separately)

answered Oct 25, 2022 at 23:28

Eliav Louski

5,6325 gold badges39 silver badges78 bronze badges

Comments

Akaisteph7 · Accepted Answer · 2025-04-30 16:13:22Z

0

My issue was that occasionally I was passing integers to the query and that was causing failures. Making sure to always pass floats fixed this even if they were np types.

answered Apr 30 at 16:13

Akaisteph7

6,9212 gold badges39 silver badges53 bronze badges

Collectives™ on Stack Overflow

MongoDB InvalidDocument: Cannot encode object

6 Answers 6

1 Comment

Comments

6 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

1 Comment

Comments

6 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related