stetl icon indicating copy to clipboard operation
stetl copied to clipboard

Make Stetl multithreaded

Open fsteggink opened this issue 9 years ago • 3 comments

Stetl is an ideal application to be made multithreaded. Most of the time it is processing datasets which consists of multiple files, and it is run in (server or desktop) environment where multiple processors or cores are available.

See also https://github.com/opengeogroep/NLExtract/issues/194

fsteggink avatar Jul 04 '16 14:07 fsteggink

From the Stetl Gitter conversation:

"Was gister (june 24, 2020 JvdB) op PyAmsterdam virtual Meetup. Erg interessante presentatie van Clayton Bezuidenhout, zie op YouTube na minuut 16: https://youtu.be/Aqu5PE3tzV0?t=998 . In feite iets Stetl-achtigs (basis Pipeline architectuur, gedreven door configuratie) maar elke module is een Thread. Communicatie loopt via Queues. Heb hem gevraagd of hij code wil delen. Celery is soort alternatief maar volgens mij is dat multi-proces met messaging etc, te zwaar. In GeoHealthCheck heb ik goede ervaring met scheduling (package APScheduler) en multi-threading (elke Healthcheck is een thread), erg stabiel. Ik plaats het even hier om het te onthouden..."

The framework is Open Source: https://bitbucket.org/clayton-bezuidenhout/threads-and-queues-example-app/src/master/

justb4 avatar Jul 22 '20 12:07 justb4

So the core architecture of Stetl is a Chain/Pipeline of Components (Inputs, Filters, Outputs) that pass Data Packets to each other. Likewise, a Component (or group of linked Components) could run in a single Thread and pass Data Packets via Queues to other Component Threads. So instead of a direct connection Components could be connected via Queues.

In other cases we may consider running multiple instances of a Chain, e.g. typically with Dutch Keyregistries (Basisregistraties) there are multiple files where the order of processing is not significant.

justb4 avatar Jul 22 '20 12:07 justb4

The best solution depends on the workflow. I would keep Stetl as 'atomic' as possible. Just use it for a single task. IMO this means that it should be executed on a single machine, and in that case I agree that threads are much more efficient than processes. An example is loading the BGT in a database. This can be seen as a single job, which can perfectly be parallellized.

On the other hand, there are many situations that you want to run multiple Stetl jobs. In this case processes should be used, and if you want to perform the processing on multiple machines, Celery or similar task queues for distributed processing are needed.

So, I would suggest to focus on options to make Stetl multithreaded when performing one single job.

fsteggink avatar Aug 05 '20 08:08 fsteggink