KeyError: 'Gene' in get_search_space
Describe the bug
Hello. I seem to be getting an error at the get_search_space step in the scenicplus pipeline with snakemake. Any immediate thoughts about why this could be? Should my region sets have genes in them?
Error output [Wed May 1 21:44:02 2024]
localrule get_search_space:
input: ACC_GEX.h5mu, genome_annotation.tsv, chromsizes.tsv
output: search_space.tsv
jobid: 11
reason: Missing output files: search_space.tsv; Input files updated by another job: chromsizes.tsv, genome_annotation.tsv
resources: tmpdir=/tmp
2024-05-01 21:44:15,970 SCENIC+ INFO Reading data
/users/ostaplet/miniconda3/envs/scenic_devel/lib/python3.11/site-packages/anndata/_core/anndata.py:522: FutureWarning: The dtype argument is deprecated and will be removed in late 2024.
warnings.warn(
/users/ostaplet/miniconda3/envs/scenic_devel/lib/python3.11/site-packages/anndata/_core/anndata.py:522: FutureWarning: The dtype argument is deprecated and will be removed in late 2024.
warnings.warn(
Traceback (most recent call last):
File "/users/ostaplet/miniconda3/envs/scenic_devel/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3800, in get_loc
return self._engine.get_loc(casted_key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Gene'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/users/ostaplet/miniconda3/envs/scenic_devel/bin/scenicplus", line 8, in <module>
sys.exit(main())
^^^^^^
File "/users/ostaplet/miniconda3/envs/scenic_devel/lib/python3.11/site-packages/scenicplus/cli/scenicplus.py", line 1137, in main
args.func(args)
File "/users/ostaplet/miniconda3/envs/scenic_devel/lib/python3.11/site-packages/scenicplus/cli/scenicplus.py", line 208, in search_space
get_search_space_command(
File "/users/ostaplet/miniconda3/envs/scenic_devel/lib/python3.11/site-packages/scenicplus/cli/commands.py", line 661, in get_search_space_command
search_space = get_search_space(
^^^^^^^^^^^^^^^^^
File "/users/ostaplet/miniconda3/envs/scenic_devel/lib/python3.11/site-packages/scenicplus/data_wrangling/gene_search_space.py", line 297, in get_search_space
if pr_gene_annotation.df['Gene'].isnull().to_numpy().any():
~~~~~~~~~~~~~~~~~~~~~^^^^^^^^
File "/users/ostaplet/miniconda3/envs/scenic_devel/lib/python3.11/site-packages/pandas/core/frame.py", line 3805, in __getitem__
indexer = self.columns.get_loc(key)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/users/ostaplet/miniconda3/envs/scenic_devel/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3802, in get_loc
raise KeyError(key) from err
KeyError: 'Gene'
[Wed May 1 21:49:28 2024]
Error in rule get_search_space:
jobid: 11
input: ACC_GEX.h5mu, genome_annotation.tsv, chromsizes.tsv
output: search_space.tsv
shell:
scenicplus prepare_data search_spance --multiome_mudata_fname ACC_GEX.h5mu --gene_annotation_fname genome_annotation.tsv --chromsizes_fname chromsizes.tsv --out_fname search_space.tsv --upstream 1000 150000 --downstream 1000 150000 --extend_tss 10 10
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-05-01T214254.576799.snakemake.log
WorkflowError:
At least one job did not complete successfully.
**Expected behavior**
A clear and concise description of what you expected to happen.
Datasnapshot This is my config file: input_data: cisTopic_obj_fname: "scATAC/cistopic_obj_withUMAP_string.pkl" GEX_anndata_fname: "scRNA/rna_batch3_coussens_updatedMeta.h5ad" region_set_folder: "scATAC/region_sets_all/" ctx_db_fname: "scATAC/mm10_screen_v10_clust.regions_vs_motifs.rankings.feather" dem_db_fname: "scATAC/mm10_screen_v10_clust.regions_vs_motifs.scores.feather" path_to_motif_annotations: "scATAC/motifs-v10nr_clust-nr.mgi-m0.001-o0.0.tbl"
output_data:
output for prepare_GEX_ACC .h5mu
combined_GEX_ACC_mudata: "ACC_GEX.h5mu"
output for motif enrichment results .hdf5
dem_result_fname: "dem_results.hdf5" ctx_result_fname: "ctx_results.hdf5"
output html for motif enrichment results .html
output_fname_dem_html: "dem_results.html" output_fname_ctx_html: "ctx_results.html"
output for prepare_menr .h5ad
cistromes_direct: "cistromes_direct.h5ad" cistromes_extended: "cistromes_extended.h5ad"
output tf names .txt
tf_names: "tf_names.txt"
output for download_genome_annotations .tsv
genome_annotation: "genome_annotation.tsv" chromsizes: "chromsizes.tsv"
output for search_space .tsb
search_space: "search_space.tsv"
output tf_to_gene .tsv
tf_to_gene_adjacencies: "tf_to_gene_adj.tsv"
output region_to_gene .tsv
region_to_gene_adjacencies: "region_to_gene_adj.tsv"
output eGRN .tsv
eRegulons_direct: "eRegulon_direct.tsv" eRegulons_extended: "eRegulons_extended.tsv"
output AUCell .h5mu
AUCell_direct: "AUCell_direct.h5mu" AUCell_extended: "AUCell_extended.h5mu"
output scplus mudata .h5mu
scplus_mdata: "scplusmdata.h5mu"
params_general: temp_dir: "scenicPlus/scratch" n_cpu: 20 seed: 666
params_data_preparation:
Params for prepare_GEX_ACC
bc_transform_func: ""lambda x: f'{x}'"" is_multiome: False key_to_group_by: "celltype" nr_cells_per_metacells: 10
Params for prepare_menr
direct_annotation: "Direct_annot" extended_annotation: "Orthology_annot"
Params for download_genome_annotations
species: "mmusculus" biomart_host: "http://www.ensembl.org"
Params for search_space
search_space_upstream: "1000 150000" search_space_downstream: "1000 150000" search_space_extend_tss: "10 10"
params_motif_enrichment: species: "mus_musculus" annotation_version: "v10nr_clust" motif_similarity_fdr: 0.001 orthologous_identity_threshold: 0.0 annotations_to_use: "Direct_annot Orthology_annot" fraction_overlap_w_dem_database: 0.4 dem_max_bg_regions: 500 dem_balance_number_of_promoters: True dem_promoter_space: 1_000 dem_adj_pval_thr: 0.05 dem_log2fc_thr: 1.0 dem_mean_fg_thr: 0.0 dem_motif_hit_thr: 3.0 fraction_overlap_w_ctx_database: 0.4 ctx_auc_threshold: 0.005 ctx_nes_threshold: 3.0 ctx_rank_threshold: 0.05
params_inference:
Params for tf_to_gene
tf_to_gene_importance_method: "GBM"
Params regions_to_gene
region_to_gene_importance_method: "GBM" region_to_gene_correlation_method: "SR"
Params for eGRN inference
order_regions_to_genes_by: "importance" order_TFs_to_genes_by: "importance" gsea_n_perm: 1000 quantile_thresholds_region_to_gene: "0.85 0.90 0.95" top_n_regionTogenes_per_gene: "5 10 15" top_n_regionTogenes_per_region: "" min_regions_per_gene: 0 rho_threshold: 0.05 min_target_genes: 10
Version (please complete the following information): I am using scenicplus 1.0a1 from the main branch
Hi @orian116
This is an issue with your genome annotation file (this should contain a column named "Gene").
Can you show the head of genome_annotation.tsv
All the best,
Seppe
Hi @SeppeDeWinter
Here is the head and tail of my genome annotation file. It does have the 'Gene' column:
I have this exact same issue and also do have the "Gene" column in my gene_annotation.tsv
@SeppeDeWinter I have narrowed down the error a bit. It seems like in the gene_search_space.py function, the column names are present before running pr_gene_annotation = pr.PyRanges(gene_annotation.query("Gene in @scplus_genes").copy()) but not after. I will try converting to a pyranges object without this. What file is "@scplus_genes" referencing? Here's a snippet of the code.
I tried defining scplus_genes before the line of code you highlighted. So far it appears to have solved the issue.
Thank you! That seemed to work but now I am getting a new error
Yea same here. No objects to concatenate... Stuck on the same thing.
Thank you! That seemed to work but now I am getting a new error
Okay, concatenate error solved for my part. I am just going to assume you processed your RNA-seq data in Seurat. I did this to fix the issue. It was damn near impossible to find the source of the error, but in the end it turned out to be an issue with how seurat exports h5ad files. For some reason it doesn't save the gene names in the right place, which means that the ACC_GEX.h5mu file you generate won't have the gene names stored correctly either.
import scanpy as sc
adata = sc.read("rna.h5ad") new_adata = adata.raw.to_adata() new_adata.var_names = adata.raw.var['_index'] #Or whatever your index is called new_adata.var_names.name = None new_adata.raw = new_adata sc.pp.normalize_total(new_adata, target_sum=1e4) sc.pp.log1p(new_adata) new_adata.write("rna_new.h5ad")
And then just use the new file in your config script.
Hope this works for you too. If so, then your next error will probably be this:
If you do get this error, updating dask to 2024.4.1 seems to fix it even though it warns you that it is not compatible with scenicplus
Also side note. If my assumption that you worked with seurat is correct, you may also have to go back into R and make sure to delete scale.data, because:
Add assay data
assay.group <- source[['assays']][[assay]] if (source$index()[[assay]]$slots[['scale.data']]) { x.data <- 'scale.data' raw.data <- 'data' } else { x.data <- 'data' raw.data <- if (source$index()[[assay]]$slots[['counts']]) { 'counts' } else { NULL } }
The data being called "raw.data" is not actually raw (Snippet from seurat-disk convert.R
Thank you! That seemed to work but now I am getting a new error
Okay, concatenate error solved for my part. I am just going to assume you processed your RNA-seq data in Seurat. I did this to fix the issue. It was damn near impossible to find the source of the error, but in the end it turned out to be an issue with how seurat exports h5ad files. For some reason it doesn't save the gene names in the right place, which means that the ACC_GEX.h5mu file you generate won't have the gene names stored correctly either.
import scanpy as sc
adata = sc.read("rna.h5ad") new_adata = adata.raw.to_adata() new_adata.var_names = adata.raw.var['_index'] #Or whatever your index is called new_adata.var_names.name = None new_adata.raw = new_adata sc.pp.normalize_total(new_adata, target_sum=1e4) sc.pp.log1p(new_adata) new_adata.write("rna_new.h5ad")
And then just use the new file in your config script.
Hope this works for you too. If so, then your next error will probably be this:
If you do get this error, updating dask to 2024.4.1 seems to fix it even though it warns you that it is not compatible with scenicplus
Also side note. If my assumption that you worked with seurat is correct, you may also have to go back into R and make sure to delete scale.data, because:
Add assay data
assay.group <- source[['assays']][[assay]] if (source$index()[[assay]]$slots[['scale.data']]) { x.data <- 'scale.data' raw.data <- 'data' } else { x.data <- 'data' raw.data <- if (source$index()[[assay]]$slots[['counts']]) { 'counts' } else { NULL } }
The data being called "raw.data" is not actually raw (Snippet from seurat-disk convert.R
How did you get to upgrade dask without upgrading pandas? When I run scenicplus after following your steps, I get an error for is_categorical being deprecated
Hmmm that sounds strange
I literally just did this: pip install dask==2024.4.1
And ran it again and then it worked
Thank you so much! I was able to get to the eRegulon dataframes

