openproblems-v2 icon indicating copy to clipboard operation
openproblems-v2 copied to clipboard

[bat_int] cell cycle genes not in adata for cellxgene_census dataset

Open KaiWaldrant opened this issue 2 years ago • 3 comments

Describe the bug The Metric cell_cycle_conservation fails with datasets from the Cell x Gene Census:

ValueError: cell cycle genes not in adata
 organism: human
 varnames: ['ENSG00000105792', 'ENSG00000128253', 'ENSG00000015413', 'ENSG00000164402', 'ENSG00000246375', 'ENSG00000176402', 'ENSG00000022976', 'ENSG00000123191', 'ENSG00000198283', 'ENSG00000092020']

To Reproduce https://tower.nf/orgs/openproblems-bio/workspaces/openproblems-bio/watch/ma2LsRoQarR8Z

Expected behavior A clear and concise description of what you expected to happen.

Additional context Add any other context about the problem here.

KaiWaldrant avatar Jan 08 '24 10:01 KaiWaldrant

Hi, so this issue comes from the fact that the cell cycle genes used are available as gene names, not Ensembl IDs. However CxG uses Ensembl IDs in the var names. I would suggest to overwrite the var_names with adata.var["feature_name"], if that column exists during the processing. Does that sound reasonable?

mumichae avatar Jan 10 '24 13:01 mumichae

I tend to prefer to set the var names to emsembl ids instead of the gene names, because otherwise there are duplicate var names. WDYT?

rcannood avatar Jan 10 '24 13:01 rcannood

In general that makes sense, but for the cell cycle metric we would still need gene symbols. Would you prefer to rename the var_names only for the metric instead?

mumichae avatar Jan 10 '24 13:01 mumichae