Performance of obs/exp pileups is significantly lower than of obs pileups with nproc > 1
Hey! I've been running pileups, both observed and observed/expected, with cooltools.pileup. I noticed that obs/exp pileups are generated 20x slower than obs pileups (as for my data) when setting nproc=8: it took 40s to create a single obs/exp pileup and only 2s to create a single obs pileup. But when setting nproc=1, the speed is almost the same: both pileups took ~12s to run.
I explored the issue, and it seems like the problem is with the multiprocessing provider. cooltools uses multiprocess library internally, while I also tried multiprocessing and concurrent.futures from stdlib. The benchmarking results for obs/exp pileup creation are below:
| Provider of the map functor | nproc | time |
|---|---|---|
| stdlib (map) | 1 | 11.5s |
| concurrent.futures.ProcessPoolExecutor | 1 | 15.8s |
| concurrent.futures.ProcessPoolExecutor | 8 | 4.52s |
| multiprocess | 8 | 21s |
| multiprocessing | 8 | 3.31s |
Both multiprocessing and concurrent.futures outperform multiprocess with nproc=8. Also note that the stdlib map in single-core mode is 2x faster than multiprocess.Pool.map with nproc=8. I suspect the reason is in how expected_df is serialized and/or passed to child processes in multiprocess.
multiprocessing uses pickle for serialization, while multiprocess uses dill. I checked that the serialization speed and byte length of the serialized expected_df are very similar between dill and pickle. So the problem might be with how multiprocess implements Pool or Queue.
Anyway, it seems like changing the multiprocessing framework would be beneficial for the performance of cooltools, especially when using CLI. Within API, users can provide a custom map_functor (but it is broken, see #535). I would be glad to start a discussion on this matter.
P.S. I don't know how reproducible the issue is, but it is in case of my data: 8 coolers, 2 different resolutions, various numbers of features and nproc.