harmony icon indicating copy to clipboard operation
harmony copied to clipboard

Can I split large gene matrix by rows and run harmony in parallel

Open jiajinlongkang opened this issue 3 years ago • 1 comments

Hi harmony group,

I have a large gene expression matrix with 31734 genes by 147185 cells. As you can see from the screenshot below, running HarmonyMatrix() on the entire expression data returns a "not enough resource" type of error. I wonder if I can split the expression matrix by rows (i.e. split into gene blocks) and run them separately? Will this generate different results? image

Thank you, Jack Kang

jiajinlongkang avatar Jun 14 '22 19:06 jiajinlongkang

Hi Jack,

Thanks for the question! With such a large matrix, I would recommend two things:

(1) Subset to highly variable genes. (2) Use a memory efficient PCA package and then feed the PCA embeddings into HarmonyMatrix(..., do_pca=FALSE).

Hope that helps!

Ilya

ilyakorsunsky avatar Jun 16 '22 17:06 ilyakorsunsky

The newer version of the package should be able to handle this input given that the other parts for the cell embedding computation are memory efficient.

pati-ni avatar Nov 30 '23 15:11 pati-ni