How to run SEACells efficiently on large-scale dataset
Hi,
First of all, thank you for developing the excellent package. I have tried to run the SEACells on our large-scale datasets (~270K cells). While it performed well, it was too slow, taking almost 3 days and 3 hours for model training over 50 iterations.
I tried two approaches: using GPU and CPU.
- with GPU I attempted to run SEACells with GPU using the following commands:
model = SEACells.core.SEACells(adata,
build_kernel_on=build_kernel_on,
n_SEACells=n_SEACells,
n_waypoint_eigs=n_waypoint_eigs,
convergence_epsilon = 1e-5,
use_gpu=True)
However, I encountered the following error:
"OutOfMemoryError: Out of memory allocating 6,121,777,152 bytes (allocated so far: 32,323,490,304 bytes)."
We have 3 GPUs, each with 32768MiB memory. I believed this would be sufficient, so I'm not sure why this error occurred.
Could you guide how to resolve this issue? Additionally, is it possible to utilize more than one GPU for this process?
- with CPU While it works, it excessively takes too much time.
model = SEACells.core.SEACells(adata,
build_kernel_on = 'X_scVI',
n_SEACells = n_SEACells,
n_waypoint_eigs = n_waypoint_eigs,
convergence_epsilon = 1e-5,
use_sparse = True)
Could you recommend solutions to improve the time and memory efficiency for running SEACells on large-scale datasets?
Thank you for your assistance.
I'm hoping someone has some input on this because I'm running into the same issue with a dataset of 240K cells. We're splitting it into smaller chunks but it still takes up SO much memory and time. We're wanting metacells of smaller sizes to match (at least as closely as possible) to the ones we already have done manually, so I'm setting the number of SEACells to 1000+ but it's just so slow.
I saw that at some point sparse matrix with GPU was planned/proposed, but never implemented. I was wondering if there is any current plan for that to happen? Would certainly love to see this tool being scalable.
I also have the same problem! I need to run my 150k cells dataset with a high number of SEACells and it is taking too long! This makes it unusable for many applications...
the same to me! multi-threading or parallel computation is needed.