batchglm icon indicating copy to clipboard operation
batchglm copied to clipboard

Stop using multiprocessing for fitting dispersion models earlier

Open le-ander opened this issue 3 years ago • 1 comments

From experience, I can say that the multiprocessing overhead when fitting dispersion models seems to be a lot larger than the code is currently written for.

ie. fitting the last 50 or so models still takes a long time and as soon as multiprocessing is switched off for the last models, things become a lot faster. maybe multiprocessing could be only used when there are more than 10x as many genes left than processors here: https://github.com/theislab/batchglm/blob/31b905b99b6baa7c94b82550d6a74f00d81966ea/batchglm/train/numpy/base_glm/estimator.py#L463

So something like: if nproc > 1 and len(idx_update) > 10 * nproc:

le-ander avatar Mar 30 '22 08:03 le-ander

To provide some numbers: on 8 cores the last iteration where multiprocessing is used (fitting like 9 or 10 genes) takes 16s, the next iteration (no multiprocessing, so 7-8 genes) takes 2s

le-ander avatar Mar 30 '22 08:03 le-ander