Saving/reading .h5ad file results in altered adata.uns['log1p']
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of scanpy.
- [ ] (optional) I have confirmed this bug exists on the master branch of scanpy.
Hello. I noticed an issue when trying to run tl.rank_genes_groups, which was the result of adata.uns['log1p'] being blank (i.e., an empty dictionary, {}). I have gone through my workflow and confirmed that adata.uns['log1p'] is the expected {'base': None} after each preprocessing step, so I don't think the issue is with any of preprocessing code. However, when I save my adata object to a .h5ad file using the .write() function and then read my .h5ad file using the sc.read() function, when I check adata.uns['log1p'] it is an empty dictionary - so maybe the issue is either in the writing or reading function? I am able to manually set adata.uns['log1p'] to {'base': None} after reading the file, and can then run downstream functions like tl.rank_genes_groups without issue. I have not had this problem previously when reading .h5ad files (into either the same Jupyter notebook or into a new Jupyter notebook). Since I can manually set adata.uns['log1p'] to {'base': None}, I don't think this issue is pressing. It's just a little strange to me. Thank you for any help/advice!
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Minimal code sample (that we can copy&paste without having any data)
This can all be run in one Jupyter notebook and should produce the issue (unless it's something exclusively on my end; I've been able to reproduce the error with my own data and one of the scanpy built-in test datasets). Sorry the code chunks are broken up/a little long; I am using the scran normalization approach outlined in the single cell tutorial.
adata = sc.datasets.pbmc3k()
sc.pp.filter_genes(adata, min_cells = 1)
# scran normalization
adata_pp = adata.copy()
sc.pp.normalize_per_cell(adata_pp, counts_per_cell_after = 1e6)
sc.pp.log1p(adata_pp)
sc.pp.pca(adata_pp, n_comps = 15)
sc.pp.neighbors(adata_pp)
sc.tl.leiden(adata_pp, key_added = 'groups', resolution = 0.5)
input_groups = adata_pp.obs['groups']
data_mat = adata.X.T
%%R -i data_mat -i input_groups -o size_factors
size_factors = sizeFactors(computeSumFactors(SingleCellExperiment(list(counts = data_mat)),
clusters = input_groups,
min.mean = 0.1))
del adata_pp
adata.obs['size_factors'] = size_factors
adata.layers['counts'] = adata.X.copy()
adata.X /= adata.obs['size_factors'].values[:,None]
sc.pp.log1p(adata)
adata.X = sp.sparse.csr_matrix(adata.X)
adata.raw = adata
sc.pp.highly_variable_genes(adata, flavor = 'cell_ranger', n_top_genes = 2000)
sc.pp.pca(adata, n_comps = 50, use_highly_variable = True, svd_solver = 'arpack')
sc.pp.neighbors(adata)
sc.tl.umap(adata)
adata.uns['log1p'] # this produces: {'base': None}
adata.write('test.h5ad')
adata = sc.read('test.h5ad')
adata.uns['log1p'] # the now produces: {}
sc.tl.leiden(adata, key_added='leiden_r1.0')
sc.tl.rank_genes_groups(adata, groupby='leiden_r1.0', key_added='rank_genes_all', method='wilcoxon') # this causes an error
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Input In [13], in <cell line: 2>()
1 # Calculate marker genes
----> 2 sc.tl.rank_genes_groups(adata, groupby='leiden_r1.0', key_added='rank_genes_all', method='wilcoxon', use_raw=False)
File /usr/local/lib/python3.8/dist-packages/scanpy/tools/_rank_genes_groups.py:590, in rank_genes_groups(adata, groupby, use_raw, groups, reference, n_genes, rankby_abs, pts, key_added, copy, method, corr_method, tie_correct, layer, **kwds)
580 adata.uns[key_added] = {}
581 adata.uns[key_added]['params'] = dict(
582 groupby=groupby,
583 reference=reference,
(...)
587 corr_method=corr_method,
588 )
--> 590 test_obj = _RankGenes(adata, groups_order, groupby, reference, use_raw, layer, pts)
592 if check_nonnegative_integers(test_obj.X) and method != 'logreg':
593 logg.warning(
594 "It seems you use rank_genes_groups on the raw count data. "
595 "Please logarithmize your data before calling rank_genes_groups."
596 )
File /usr/local/lib/python3.8/dist-packages/scanpy/tools/_rank_genes_groups.py:93, in _RankGenes.__init__(self, adata, groups, groupby, reference, use_raw, layer, comp_pts)
82 def __init__(
83 self,
84 adata,
(...)
90 comp_pts=False,
91 ):
---> 93 if 'log1p' in adata.uns_keys() and adata.uns['log1p']['base'] is not None:
94 self.expm1_func = lambda x: np.expm1(x * np.log(adata.uns['log1p']['base']))
95 else:
KeyError: 'base'
Versions
anndata 0.8.0rc1 scanpy 1.8.2 sinfo 0.3.4
PIL 9.0.1 anndata2ri 1.0.6 annoy NA asttokens NA backcall 0.2.0 backports NA beta_ufunc NA binom_ufunc NA cffi 1.15.0 cycler 0.10.0 cython_runtime NA dateutil 2.8.2 debugpy 1.5.1 decorator 5.1.1 defusedxml 0.7.1 dunamai 1.11.0 entrypoints 0.4 executing 0.8.3 get_version 3.5.4 h5py 3.6.0 hypergeom_ufunc NA igraph 0.9.9 ipykernel 6.9.1 ipython_genutils 0.2.0 ipywidgets 7.6.5 jedi 0.18.1 jinja2 3.0.3 joblib 1.1.0 jupyter_server 1.13.5 kiwisolver 1.4.0 leidenalg 0.8.9 llvmlite 0.38.0 markupsafe 2.1.1 matplotlib 3.5.1 matplotlib_inline NA mpl_toolkits NA natsort 8.1.0 nbinom_ufunc NA numba 0.55.1 numexpr 2.8.1 numpy 1.21.0 packaging 21.3 pandas 1.4.1 parso 0.8.3 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA prompt_toolkit 3.0.28 ptyprocess 0.7.0 pure_eval 0.2.2 pycparser 2.21 pydev_ipython NA pydevconsole NA pydevd 2.6.0 pydevd_concurrency_analyser NA pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.11.2 pynndescent 0.5.6 pyparsing 3.0.7 pytz 2021.3 pytz_deprecation_shim NA rpy2 3.4.2 scipy 1.8.0 scrublet NA seaborn 0.11.2 sitecustomize NA six 1.14.0 skimage 0.19.2 sklearn 1.0.2 stack_data 0.2.0 statsmodels 0.13.2 tables 3.7.0 texttable 1.6.4 threadpoolctl 3.1.0 tornado 6.1 tqdm 4.63.0 traitlets 5.1.1 typing_extensions NA tzlocal NA umap 0.5.2 wcwidth 0.2.5 zmq 22.3.0
IPython 8.1.1 jupyter_client 7.1.2 jupyter_core 4.9.2 jupyterlab 3.3.1 notebook 6.4.8
Python 3.8.10 (default, Nov 26 2021, 20:14:08) [GCC 9.3.0] Linux-5.10.76-linuxkit-x86_64-with-glibc2.29 8 logical CPU cores, x86_64
Session information updated at 2022-03-17 17:22
Have the same, changing anndata to 0.7.8 version helps;
Could someone please tell me how I can save a h5ad file from a Jupyter Notebook that I've made some modifications with another name and in the same path as before?
Hi, I found the answer to this question in another Q&A, please refer to (https://bytemeta.vip/repo/scverse/scanpy/issues/2239)
Got this on scanpy 1.9.1, did what #2239 suggested, as a temporary workaround
adata.uns['log1p']["base"] = None
It looks like the adata.uns['log1p']["base"] = None is lost at the next reading.
Using an h5ad data set saved in result_file, the following happens:
adata = sc.read_h5ad(results_file)
adata.uns['log1p']['base'] is None
#---------------------------------------------------------------------------
#KeyError Traceback (most recent call last)
#/home/sebastien/workshop_scRNA-Seq_UCLA/scanpy_tut3.py in line 2
# [153](file:///home/sebastien/workshop_scRNA-Seq_UCLA/scanpy_tut3.py?line=152) adata = sc.read(results_file)
#----> [154](file:///home/sebastien/workshop_scRNA-Seq_UCLA/scanpy_tut3.py?line=153) adata.uns['log1p']['base'] is None
#
#KeyError: 'base'
adata.uns['log1p']["base"] = None # see https://github.com/scverse/scanpy/issues/2239
adata.uns['log1p']['base'] is None
# now returns True
adata.write(results_file)
adata.uns['log1p']['base'] is None
# still returns True
adata = sc.read_h5ad(results_file) # same with adata = sc.read(results_file)
adata.uns['log1p']['base'] is None
#
#KeyError: 'base'
This was with sc.__version__ 1.9.1
Hello, for information: I encountered the same problem with scanpy=1.9.1 either with anndata=0.8.0 or 0.7.8 or 0.7.6.
I found it useful by calling
scanpy.pp.log1p(adata)
again before the function that returns the keyerror:base
For example, in the PBMC3K tutorial, calling this function again before step 43: Comparing to a single cluster.
sc.pp.log1p(adata)
sc.tl.rank_genes_groups(adata, 'leiden', groups=['0'], reference='1', method='wilcoxon')
sc.pl.rank_genes_groups(adata, groups=['0'], n_genes=20)
I am new to scanpy so I do not know what exactly is going on here so all I can say is "for some reason" the code works this way
The log1p→base thing was fixed in #2546