[bat_int] cell cycle genes not in adata for cellxgene_census dataset
Describe the bug
The Metric cell_cycle_conservation fails with datasets from the Cell x Gene Census:
ValueError: cell cycle genes not in adata
organism: human
varnames: ['ENSG00000105792', 'ENSG00000128253', 'ENSG00000015413', 'ENSG00000164402', 'ENSG00000246375', 'ENSG00000176402', 'ENSG00000022976', 'ENSG00000123191', 'ENSG00000198283', 'ENSG00000092020']
To Reproduce https://tower.nf/orgs/openproblems-bio/workspaces/openproblems-bio/watch/ma2LsRoQarR8Z
Expected behavior A clear and concise description of what you expected to happen.
Additional context Add any other context about the problem here.
Hi, so this issue comes from the fact that the cell cycle genes used are available as gene names, not Ensembl IDs. However CxG uses Ensembl IDs in the var names. I would suggest to overwrite the var_names with adata.var["feature_name"], if that column exists during the processing. Does that sound reasonable?
I tend to prefer to set the var names to emsembl ids instead of the gene names, because otherwise there are duplicate var names. WDYT?
In general that makes sense, but for the cell cycle metric we would still need gene symbols. Would you prefer to rename the var_names only for the metric instead?