RcppML icon indicating copy to clipboard operation
RcppML copied to clipboard

nmf() and mclapply

Open sup27606 opened this issue 1 year ago • 2 comments

Dear developer, Thank you for this great package and fast NMF implementation.

I am reporting an issue that I am facing while running the nmf() function in conjunction with mclapply.

When I try to run nmf() within mclapply, all cores go 100% indefinitely and never finishes. I wonder if there's a conflict between nmf's internal parallelization and that from mclapply. I invoke setRcppMLthreads(1) within mclapply before running nmf().

Also, this behavior appeared recently when I updated the RcppML package. In a previous version of the package (one where output of the nmf() used to be a class object instead of a list), there were no issues with running within mclapply.

I am running this on a single node of a linux cluster (Intel Xeon cpu with 28 cores). Here's a sample code:

nmf_obj_list = mclapply(rep(10, 28), function(k) { setRcppMLthreads(1) temp = RcppML::nmf(nmf_input, k, maxit = 1e6, verbose = F) temp }, mc.cores = 28)

nmf_input is a matrix of dimensions 9622 x 200. RcppML version is 0.5.6.

Thanks in advance for your advice.

sup27606 avatar Jan 07 '25 20:01 sup27606

Interesting, RcppML uses openmp on the backend, but as far as I am aware I have set the setRcppMLthreads(n) will control how many threads openmp uses. RcppML also uses Eigen, which has its own BLAS and in turn its own setNumThreads(n) on the C++ side, but that listens to openmp.

Try setting the number of threads outside of the mclapply loop.

Just as a general comment, you will almost certainly be better off using your 28 cores to parallelize individual factorizations, where each factorization is done sequentially, as opposed to trying to do 28 factorizations in parallel. Much better thread utilization.

zdebruine avatar Jan 07 '25 21:01 zdebruine

Thank you for your response. Somehow I could not make mclapply work. Ultimately I could run in parallel by switching to future_lapply with plan('multisession').

By the way, I have nearly 4000 NMF runs to perform (rank sweeps coupled with 200 runs per rank at different seeds) and each run finishes in a couple of minutes even on one core. Hence, for my situation at least, I felt that distributing the 4000 runs over multiple cpus (28 cores) was more efficient than running each NMF sequentially using all 28 cores.

sup27606 avatar Feb 04 '25 20:02 sup27606

Aha, if you have to fit that many NMF models and you have the memory on-node available to do so, then I would recommend your plan of action. Ensure that RcppML.set_num_threads(1) has been called, so that each RcppML session is only using 1 thread (otherwise you will have a terrible segfault).

zdebruine avatar Jun 06 '25 14:06 zdebruine