gematria icon indicating copy to clipboard operation
gematria copied to clipboard

Parallelize BB processing script

Open boomanaiden154 opened this issue 1 year ago • 2 comments

This patch parallelizes the BB processing script. This significantly speeds up the processing of BBs. Eventually diminishing returns are reached, especially on systems with a large number of threads, most likely due to blocking on IO.

boomanaiden154 avatar Apr 02 '24 03:04 boomanaiden154

I'm seeing near linear speedup up to about 16 threads and then I start to hit diminishing returns.

boomanaiden154 avatar Apr 02 '24 03:04 boomanaiden154

Converting to a draft as it needs some more work. Currently not handling the case where a batch is left with some items at the end.

boomanaiden154 avatar Apr 02 '24 06:04 boomanaiden154

Closing this for now. I think I want to restructure this to do something different rather than just parallelizing in process. Adding Python bindings and using something like Apache Beam I think would enable more scalability for dataset processing.

boomanaiden154 avatar Jun 23 '24 21:06 boomanaiden154