-1

I'm having a problem understanding which could be a way to design a Python worker+API that process data gathered from the Internet and let other services(external) access this data thru an API. I've to tell you that I don't have a specific background in Computer Science just a lot of trial and errors that led me to get to this.

The Question

How can avoid to find my API busy because the worker is fetching the data? Threading and queues seem to be the solution, but I'm having problems adapting my project to them. Could someone suggest me which approach should be used in this case? and projects that maybe are similar to this one?

I've already written a question on this on Stack without any answer, here you can find the code (my first question + code).

This problem can be also framed in a different scale (bigger) into this question ( multiple workers+Flask APIs )

This is my situation More or Less   script structure

 References

I've also checked them out:

6
  • Perhaps you can look up Flask-Celery Commented Jan 23, 2018 at 15:12
  • I've just checked how Celery works, it also capable on helping me handling when I can access or not to my Gloval Var? Commented Jan 23, 2018 at 15:23
  • I'm not sure I understand the global var, but serializing an object across tasks (for example, using pickle) should be possible Commented Jan 23, 2018 at 16:29
  • global var, means a var that is shared between both processes. It wasn't clear. I'm sorry. Pickle it sound interesting but I actually don't understand if it suited for this case. Commented Jan 23, 2018 at 17:06
  • Pickle is a native Python serialization format. It can be shared between processes, or over a network socket. If you want a more robust persistence mechanism, then you can try a database. For example, this uses Redis. allynh.com/blog/… Commented Jan 23, 2018 at 20:38

1 Answer 1

1

Use the Threading library. Keep the main thread open for handling responses and spin off 'job' threads that are thread.joined() to each other to form a queue.

You'll need to provide the API user with a job id(best to persist these, and perhaps progress and status update info, outside the app in a database), and then allow them to query their job's status/download their job from another endpoint. You could keep another queue of threads handling anything compute intensive related to collecting/downloading.

All that said, this can all also be accomplished using a micro service architecture in which you have one app scheduling jobs, one app retrieving/processing data, and one app handling status/download requests. These would be joined via http interfaces(restful would be great) and a database for common persistence of data.

The benefit of this last approach is in each app being independently scalable from an availability and resources perspective within some framework like Kubernetes.

UPDATE:

Just read your original post and your main issue seems to be persisting your data in a global variable, rather than a database. Keep your data in a database, and provide it to clients either through a separate application, or a set of threads that are set aside and available in your current app.

UPDATE response to OP comment:

Stefano, in the use case you're describing, there is no need for any of the components to be connected to each other. They all only need to be connected to the database.

The data collection service should collect the data, and then submit it to the database for storage, where the "request data" component can find and retrieve it.

If there is a need for user input to this process, then the "submit request for data" component should accept that request, provide the user with an id, and then store that job's requirements in the database for the data collector component to discover. You would then need one more component for serving a status/progress on the job from the database to the user.

What DB are you using? If it's slow/busy, you can scale the resources available to it (RAM), or you can look at batching your updates from the data collector, which is the most likely culprit of unnecessary DB overhead. How many transactions are you submitting per second? And of what size?

Ed anche, si sei italiano, poui domandarmi in la lingua tua si sia piu facile communicare questi detagli technichi.

Sign up to request clarification or add additional context in comments.

1 Comment

thank you for your answer. I really appreciate it. I've been working on a similar project for a while and I've tried a lot of attempts and tests. I've also tried with a DB (it was my first attempt) but I had to move away cause I was having an issue accessing the DBs when it was busy serving other processes. Now I've divided tasks over multiple mini services (like you suggested, and it works) But the central service, that collect all the information, has to deal from one side in accessing each mini service, and from the other a mini service with a web interface. and sometimes, it's busy. :(

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.