2

I'm writing a script in Python that will scrape some pages from my web server and put them in a file. I'm using mechanize.Browser() module for this particular task.

However, I've found that creating one single instance of mechanize.Browser() is rather slow. Is there a way I could relatively painlessly use multihreading/multiprocessing (i.e. issue several GET requests at once)?

8
  • Have you looked at the Python threading module? Commented Oct 20, 2011 at 5:10
  • Isn't threading module only for starting a new CPU thread? Commented Oct 20, 2011 at 5:12
  • Related: stackoverflow.com/questions/4119680/… and stackoverflow.com/questions/4139988/… and stackoverflow.com/questions/6905800/… Commented Oct 20, 2011 at 5:17
  • well, if you don't want to use threading as @ObscureRobot suggested, you can try multiprocessing. Commented Oct 20, 2011 at 5:30
  • ObscureRobot and imm: I don't want CPU threads. As my post says, I want "[to] issue several GET requests at once" - as in HTTP GET request. @phaedrus - thanks, those are an interesting read. Doesn't seem to be very easy to implement, looks like I'd have to rewrite the entire app (over 3000 lines of code) Commented Oct 20, 2011 at 5:53

2 Answers 2

1

Use gevent or eventlet to get concurrent network IO.

Sign up to request clarification or add additional context in comments.

Comments

1

If you want industrial strength Python web scraping, check out scrapy. It uses Twisted for async comms and is blindingly fast. Being able to spider through 50 pages per-second isn't an unrealistic expectation.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.