pandarallel icon indicating copy to clipboard operation
pandarallel copied to clipboard

One process does all jobs by itself

Open adriantre opened this issue 6 years ago • 3 comments

First off, I love the ease of use of this project!

I am trying to multiprocess the reading of an image (using rasterio. Each row in my dataframe will read a window from the image source. The image path is distributed to each process, which will then open it.

When I run parallel_apply(), all four processes seem to start, but only the first one continues. The three others stops at job 1, and the first one performs all jobs (152/38). The result is a dataframe concatenated from all processes, having four times as many rows as the input, with NaN-values for 3/4 of the rows. See screen cap below.

Do you have any input on why this is happening?

Skjermbilde 2019-10-23 kl  15 18 45

adriantre avatar Oct 23 '19 13:10 adriantre

Hello,

I'm not sure where does your issue comes from.

  1. Could you please try with pandarallel v1.4.0 ?
  2. Could you also try the "classic" pandas way (if not already done) to be sure this issue comes from pandarallel?
  3. If not solved by 1., could you please send me the code you used to get this error?

nalepae avatar Nov 11 '19 19:11 nalepae

My use case is kind of specific. Each thread should open a separate dataset reader and pass that to their respective jobs. (i.e. 4 dataset readers, one per thread)

I think with pandarellel I only have the option for each job to open a dataset reader. This will be very IO demanding and slow. (Or I could send the same reader to all threads, but that is not allowed). So pandarallel is sadly not a good fit for my case.

So I ended up implementing a method for this myself, based on this gist. This way I can run an intermediate (partially evaluated) function that opens a dataset reader for each thread/process.

So I have not made it work with pandarellel, but I guess it has to do with the threads or jobs blocking each others dataset readers.

Edit: Here is a gist describing my solution. Maybe it can be further generalised to support other cases when there is a need for running an intermediate function per thread/process. It is not as user-friendly and beautifully implemented as pandarallel, so let me know if something is unclear!

adriantre avatar Nov 12 '19 13:11 adriantre