lemur icon indicating copy to clipboard operation
lemur copied to clipboard

lemur function force conversion to dense matrix

Open edg1983 opened this issue 1 year ago • 8 comments

Hello!

Thanks for the interesting package!

I'm testing it on one of our datasets, which contains about 1M cells and 1.5k individuals. When calling the lemur function, I noticed that the tool converts the sparse count matrix into a dense array. This occupies a huge amount of memory and thus limits the possibility of applying the method to large-scale data.

Briefly, I have a SingleCellExperiment object that I generated from a dgCMatrix, and data tables like this.

sce <- SingleCellExperiment(
    assays = list(logcounts = sparse_matrix),
    colData = cell_info,
    rowData = gene_info
)

Then I'm trying to run lemur as suggested in the tutorial

fit <- lemur(sce, design = ~ phenotype + Exp, 
             n_embedding = 20, test_fraction = 0.5)

And I get this message:

Warning: sparse->dense coercion: allocating vector of size 172.0 GiB

The memory occupancy then quickly goes up to about 250 GB.

Am I doing something wrong? Is there a way to make the method work with a sparse matrix (the standard dgCMatrix used by SingleCellExperiment)?

Thanks!

edg1983 avatar Nov 27 '24 11:11 edg1983

Hi Edoardo,

Thank you for reaching out. As far as I can see, you are not doing anything wrong; lemur currently needs to convert any input matrix to dense storage to do some of the underlying computations.

There are a few options to deal with this:

  • You can select a subset of genes. I typically work on a reasonable number (like 1000) of highly variable genes.
  • You can subsample the cells, which allows you to fit the model, and later project the full data on the fit using the project_on_lemur_fit function.
  • I have started an experimental branch that is trying to fully support fitting lemur on sparse matrices: (https://github.com/const-ae/lemur/compare/devel...improve_memory_efficiency. This is work in progress, and I am a bit unsure how to proceed. At its core, lemur fits PCAs currently using irlba which only works on dense matrices; the rsvd package can fit PCAs on sparse matrices, but the result can differ (https://github.com/erichson/rSVD/issues/11).
  • A fourth option would be to add support for file-backed matrix formats like HDF5. I haven't yet explored this in any more depth, but it could be an attractive trade-off between a slower run-time but less memory demands.

https://github.com/const-ae/lemur/compare/devel...improve_memory_efficiency

const-ae avatar Nov 28 '24 11:11 const-ae

Thanks for the suggestions! I'll try them and let you know.

So far, I've tried running lemur with test_fraction = 0.8 to reduce the amount of data that went into the model fitting procedure, but it goes out of memory even reserving 520 GB (which is the maximum available on our system at the moment). Specifically, it dies at this step Fit Grassmann linear model.

edg1983 avatar Nov 28 '24 11:11 edg1983

Can you give me a short example of the steps to take if I want to try the project_on_lemur_fit approach?

edg1983 avatar Nov 28 '24 11:11 edg1983

Can you give me a short example of the steps to take if I want to try the project_on_lemur_fit approach?

Here you go (from the man page of project_on_lemur_fit):

data(glioblastoma_example_data)
subset1 <- glioblastoma_example_data[,1:2500]
subset2 <- glioblastoma_example_data[,2501:5000]
fit <- lemur(subset1, design = ~ condition, n_emb = 5,
             test_fraction = 0, verbose = FALSE)

fit2 <- project_on_lemur_fit(fit, subset2, return = "lemur_fit")
fit2

const-ae avatar Nov 28 '24 11:11 const-ae

Great thanks!

In the first fit, test_fraction must be zero because all cells are used only to fit the model, right? Then, all cells in the fit2 will be projected and used for DE testing. Correct?

Do you have any suggestions on a reasonable fraction of cells to use in the first fit?

edg1983 avatar Nov 28 '24 11:11 edg1983

In the first fit, test_fraction must be zero because all cells are used only to fit the model, right? Then, all cells in the fit2 will be projected and used for DE testing. Correct?

I actually haven't thought about the is_test_data label. It turns out that it is also set to FALSE in the projected data. But you can overwrite that by using the int_colData function:

table(fit$is_test_data)
#> 
#> FALSE 
#>  2500
table(fit2$is_test_data)
#> 
#> TRUE 
#> 2500

int_colData(fit2)[["is_test_data"]] <- TRUE
fit2$test_data
#> class: lemur_fit 
#> dim: 300 2500 
#> metadata(9): n_embedding design ... use_assay row_mask
#> assays(2): counts logcounts
#> rownames(300): ENSG00000210082 ENSG00000118785 ... ENSG00000167468
#>   ENSG00000139289
#> rowData names(6): gene_id symbol ... strand. source
#> colnames(2500): CTCATCTGGCAC TTAACCCGGCTT ... TGAACAGTGCGT TGACCGGAATGC
#> colData names(10): patient_id treatment_id ... sample_id id
#> reducedDimNames(2): linearFit embedding
#> mainExpName: NULL
#> altExpNames(0):

Do you have any suggestions on a reasonable fraction of cells to use in the first fit?

It depends on the complexity of the dataset. You need to have enough cells that for each condition, lemur can identify the relevant cell type variation. An easy way to check this, is to make a UMAP/tSNE of the fit_reduced$embedding and confirm that you see all the expected variation.

const-ae avatar Nov 28 '24 12:11 const-ae

Thanks for the advice!

I've tried limiting the analysis to highly variable genes (5k in my case), and I can now complete the workflow using about 320G of memory.

I will now experiment also with the fit and project approach.

I have an additional question about the trait type we can test using lemur. I see that test_de requires setting up a contrast to compare 2 categories. Thus, I assume that only binary traits or traits with a small number of categories can be tested.

  1. Is it possible to also test for a quantitative trait (or are you working to support this in the future)?
  2. Do you have a suggestion for performing a comparison when there are more than 2 categories?

Thanks a lot!

edg1983 avatar Nov 29 '24 07:11 edg1983

I've tried limiting the analysis to highly variable genes (5k in my case), and I can now complete the workflow using about 320G of memory.

That's great to hear :)

Thus, I assume that only binary traits or traits with a small number of categories can be tested.

  1. Is it possible to also test for a quantitative trait (or are you working to support this in the future)?

You can fit quantitative traits (e.g., in our preprint, we have an example using time post-fertilization in embryonic development). For the contrast, you could then pick two representative time points (for example, first and last) and compare those. In a traditional linear model, contrasting two points is the same as testing the slope if the slope is different from zero. In LEMUR, this holds approximately.

  1. Do you have a suggestion for performing a comparison when there are more than 2 categories?

Do you mean something like an ANOVA where you check if any of the categories differ? I have thought about this, and it would be very cool to have, but it is a bit tricky to develop from a statistical perspective. The idea would be to implement a likelihood ratio test which is pretty straightforward:

  1. Fit the full model, and calculate the residual sum of squares (RSS),
  2. FIt a reduced model and calculate the RSS,
  3. Calculate the test statistic as the ratio of the two (adjusting for the difference in the degrees of freedom).

Lastly, (and this is the tricky part) compare the test statistic against a null distribution. In traditional linear models, you can prove that the null distribution follows an F distribution that depends on the degrees of freedom of the full and reduced model.

However, for the LEMUR model, I don't know what shape the null distribution has. The complicating factors are that

  • I am not sure how to include the number of latent dimensions in the calculation
  • I don't know how to express the fact that I fit the models on the $C$ cells, but in reality only have $N$ samples where typically $N \ll C$.

(This could be a fun statistics paper that someone should write. So if you or anyone else reading this feels like taking a stab, contact me!)

The alternative to finding the shape of the null distribution analytically would be bootstrapping. This would work by shuffling the labels per samples, fitting LEMUR ~1000 times, and comparing the original RSS, against the empirical RSS null distribution. The downsides of this approach are that you need enough replicates so that there is enough variation for the bootstrap, and the repeatedly fitting LEMUR is computationally expensive.

const-ae avatar Nov 30 '24 11:11 const-ae