3

I am trying to run Scrapy from Python. I'm looking at this code which (source):

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from testspiders.spiders.followall import FollowAllSpider

spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here

My issue is that I'm confused on how to adjust this code to run my own spider. I have called my spider project "spider_a" which specifies the domain to crawl within the spider itself.

What I am asking is, if I run my spider with the following code:

scrapy crawl spider_a

How do I adjust the example python code above to do the same?

2 Answers 2

2

Just import it and pass to crawler.crawl(), like:

from testspiders.spiders.spider_a import MySpider

spider = MySpider()
crawler.crawl(spider)
Sign up to request clarification or add additional context in comments.

1 Comment

Running this way ignores user's settings.
1

In Scrapy 0.19.x (may work with older versions) you can do the following.

spider = FollowAllSpider(domain='scrapinghub.com')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here

You can even call the command directly from a script like:

from scrapy import cmdline
cmdline.execute("scrapy crawl followall".split())  #followall is the spider's name

Take a look on my answer here. I changed the official documentation so now your crawler use your settings and can produce the output.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.