Async, multithreaded scraping in Python with limited threads

Question

We have to refactor scraping algorithm. To speed it up we came up to conclusion to multi-thread processes (and limit them to max 3). Generally speaking scraping consists of following aspects:

Scraping (async request, takes approx 2 sec)
Image processing (async per image, approx 500ms per image)
Changing source item in DB (async request, approx 2 sec)

What I am aiming to do is to create batch of scraping requests and while looping through them, create a stack of consequent async operations: Process images and as soon as images are processed -> change source item.

In other words - scraping goes. but image processing and changing source items must be run in separate limited async threads.

Only think I don't know how to stack the batch and limit threads.

Has anyone came across the same task and what approach have you used?

rezna · Accepted Answer · 2019-03-13 09:33:42Z

1

What you're looking for is consumer-producer pattern. Just create 3 different queues and when you process the item in one of them, queue new work in another. Then you can 3 different threads each of them processing one queue.

answered Mar 13, 2019 at 9:33

rezna

2,2432 gold badges17 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Timothy Over a year ago

Your answer lead me to this bogotobogo.com/python/Multithread/… great stuff...

Collectives™ on Stack Overflow

Async, multithreaded scraping in Python with limited threads

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related