2

I'm working on a web scrapper that has two classes. One takes the data and the other class processes it. The end result of the first class is a list of elements such as results = [1, 2, 3, 4, 5, ...] . The problem is sometimes, due to server-side error, the list can come out empty. How can I loop through this to restart the process until the list is not empty?

I kinda solved it like this. But I'm not sure if this is efficient or a good practice.

class DataScrapper:
      def __init__(self):
         ...
      
      def getData(self):
         self.results = []
         while not self.results:
              ...
         return self.results

Is this a pythonic way of solving the problem? Is there another more efficient way? Thank you very much.

5
  • it seems more a businuss logic, other than python technical problem. Commented Jan 19, 2022 at 6:49
  • 2
    If ... is a server side call this may get you blacklisted due to too many requests to the server. You may be better off accepting the empty list and give that as a return - or raise an exception. Commented Jan 19, 2022 at 6:50
  • 2
    This seems like a good and explicit approach. You might want to add a delay between attempts and a max number of tries, but this looks good. Commented Jan 19, 2022 at 6:52
  • 1
    Take a look at this short API guide from AWS (suggests just what the commenters above me suggest): docs.aws.amazon.com/general/latest/gr/api-retries.html Commented Jan 19, 2022 at 6:55
  • Thank you everyone! I will keep this things on my mind while refactoring the code. Thank you so much!! Commented Jan 19, 2022 at 16:39

1 Answer 1

2

Your idiom is simple and good for most cases.

You must however keep in mind of 2 things:

  1. You don't cap the retries. If the server is down for a long time, your script will get stuck.
  2. You keep on generating requests even during downtimes. That can cause a large client and server load. I highly suggest using an exponential backoff strategy.

A quick search in google found the backoff library which allows you to do both:

@backoff.on_predicate(backoff.expo, lambda x: x == [], max_tries=10)
def getData(self):
     self.results = []
     ...
     return self.results

It checks the return value, and if it's an empty list, runs the function again with increasing delays until you reach 10 tries.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you very time for your time and suggestion! I'll try this :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.