scanpy Saving/reading .h5ad file results in altered adata.uns['log1p']

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of scanpy.
[ ] (optional) I have confirmed this bug exists on the master branch of scanpy.

Hello. I noticed an issue when trying to run tl.rank_genes_groups, which was the result of adata.uns['log1p'] being blank (i.e., an empty dictionary, {}). I have gone through my workflow and confirmed that adata.uns['log1p'] is the expected {'base': None} after each preprocessing step, so I don't think the issue is with any of preprocessing code. However, when I save my adata object to a .h5ad file using the .write() function and then read my .h5ad file using the sc.read() function, when I check adata.uns['log1p'] it is an empty dictionary - so maybe the issue is either in the writing or reading function? I am able to manually set adata.uns['log1p'] to {'base': None} after reading the file, and can then run downstream functions like tl.rank_genes_groups without issue. I have not had this problem previously when reading .h5ad files (into either the same Jupyter notebook or into a new Jupyter notebook). Since I can manually set adata.uns['log1p'] to {'base': None}, I don't think this issue is pressing. It's just a little strange to me. Thank you for any help/advice!

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Minimal code sample (that we can copy&paste without having any data)

This can all be run in one Jupyter notebook and should produce the issue (unless it's something exclusively on my end; I've been able to reproduce the error with my own data and one of the scanpy built-in test datasets). Sorry the code chunks are broken up/a little long; I am using the scran normalization approach outlined in the single cell tutorial.

adata = sc.datasets.pbmc3k()
sc.pp.filter_genes(adata, min_cells = 1)

# scran normalization
adata_pp = adata.copy()
sc.pp.normalize_per_cell(adata_pp, counts_per_cell_after = 1e6)
sc.pp.log1p(adata_pp)
sc.pp.pca(adata_pp, n_comps = 15)
sc.pp.neighbors(adata_pp)
sc.tl.leiden(adata_pp, key_added = 'groups', resolution = 0.5)
input_groups = adata_pp.obs['groups']
data_mat = adata.X.T

%%R -i data_mat -i input_groups -o size_factors
size_factors = sizeFactors(computeSumFactors(SingleCellExperiment(list(counts = data_mat)), 
                                             clusters = input_groups, 
                                             min.mean = 0.1))

del adata_pp
adata.obs['size_factors'] = size_factors
adata.layers['counts'] = adata.X.copy()
adata.X /= adata.obs['size_factors'].values[:,None]
sc.pp.log1p(adata)
adata.X = sp.sparse.csr_matrix(adata.X)
adata.raw = adata

sc.pp.highly_variable_genes(adata, flavor = 'cell_ranger', n_top_genes = 2000)
sc.pp.pca(adata, n_comps = 50, use_highly_variable = True, svd_solver = 'arpack')
sc.pp.neighbors(adata)
sc.tl.umap(adata)
adata.uns['log1p'] # this produces: {'base': None}
adata.write('test.h5ad')

adata = sc.read('test.h5ad')
adata.uns['log1p'] # the now produces: {}
sc.tl.leiden(adata, key_added='leiden_r1.0')
sc.tl.rank_genes_groups(adata, groupby='leiden_r1.0', key_added='rank_genes_all', method='wilcoxon') # this causes an error

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [13], in <cell line: 2>()
      1 # Calculate marker genes
----> 2 sc.tl.rank_genes_groups(adata, groupby='leiden_r1.0', key_added='rank_genes_all', method='wilcoxon', use_raw=False)

File /usr/local/lib/python3.8/dist-packages/scanpy/tools/_rank_genes_groups.py:590, in rank_genes_groups(adata, groupby, use_raw, groups, reference, n_genes, rankby_abs, pts, key_added, copy, method, corr_method, tie_correct, layer, **kwds)
    580 adata.uns[key_added] = {}
    581 adata.uns[key_added]['params'] = dict(
    582     groupby=groupby,
    583     reference=reference,
   (...)
    587     corr_method=corr_method,
    588 )
--> 590 test_obj = _RankGenes(adata, groups_order, groupby, reference, use_raw, layer, pts)
    592 if check_nonnegative_integers(test_obj.X) and method != 'logreg':
    593     logg.warning(
    594         "It seems you use rank_genes_groups on the raw count data. "
    595         "Please logarithmize your data before calling rank_genes_groups."
    596     )

File /usr/local/lib/python3.8/dist-packages/scanpy/tools/_rank_genes_groups.py:93, in _RankGenes.__init__(self, adata, groups, groupby, reference, use_raw, layer, comp_pts)
     82 def __init__(
     83     self,
     84     adata,
   (...)
     90     comp_pts=False,
     91 ):
---> 93     if 'log1p' in adata.uns_keys() and adata.uns['log1p']['base'] is not None:
     94         self.expm1_func = lambda x: np.expm1(x * np.log(adata.uns['log1p']['base']))
     95     else:

KeyError: 'base'

Versions

anndata 0.8.0rc1 scanpy 1.8.2 sinfo 0.3.4

PIL 9.0.1 anndata2ri 1.0.6 annoy NA asttokens NA backcall 0.2.0 backports NA beta_ufunc NA binom_ufunc NA cffi 1.15.0 cycler 0.10.0 cython_runtime NA dateutil 2.8.2 debugpy 1.5.1 decorator 5.1.1 defusedxml 0.7.1 dunamai 1.11.0 entrypoints 0.4 executing 0.8.3 get_version 3.5.4 h5py 3.6.0 hypergeom_ufunc NA igraph 0.9.9 ipykernel 6.9.1 ipython_genutils 0.2.0 ipywidgets 7.6.5 jedi 0.18.1 jinja2 3.0.3 joblib 1.1.0 jupyter_server 1.13.5 kiwisolver 1.4.0 leidenalg 0.8.9 llvmlite 0.38.0 markupsafe 2.1.1 matplotlib 3.5.1 matplotlib_inline NA mpl_toolkits NA natsort 8.1.0 nbinom_ufunc NA numba 0.55.1 numexpr 2.8.1 numpy 1.21.0 packaging 21.3 pandas 1.4.1 parso 0.8.3 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA prompt_toolkit 3.0.28 ptyprocess 0.7.0 pure_eval 0.2.2 pycparser 2.21 pydev_ipython NA pydevconsole NA pydevd 2.6.0 pydevd_concurrency_analyser NA pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.11.2 pynndescent 0.5.6 pyparsing 3.0.7 pytz 2021.3 pytz_deprecation_shim NA rpy2 3.4.2 scipy 1.8.0 scrublet NA seaborn 0.11.2 sitecustomize NA six 1.14.0 skimage 0.19.2 sklearn 1.0.2 stack_data 0.2.0 statsmodels 0.13.2 tables 3.7.0 texttable 1.6.4 threadpoolctl 3.1.0 tornado 6.1 tqdm 4.63.0 traitlets 5.1.1 typing_extensions NA tzlocal NA umap 0.5.2 wcwidth 0.2.5 zmq 22.3.0

IPython 8.1.1 jupyter_client 7.1.2 jupyter_core 4.9.2 jupyterlab 3.3.1 notebook 6.4.8

Python 3.8.10 (default, Nov 26 2021, 20:14:08) [GCC 9.3.0] Linux-5.10.76-linuxkit-x86_64-with-glibc2.29 8 logical CPU cores, x86_64

Session information updated at 2022-03-17 17:22

Mar 17 '22 18:03 oligomyeggo

Have the same, changing anndata to 0.7.8 version helps;

Mar 25 '22 03:03 ghost

Could someone please tell me how I can save a h5ad file from a Jupyter Notebook that I've made some modifications with another name and in the same path as before?

May 26 '22 19:05 bclopesrs

Hi, I found the answer to this question in another Q&A, please refer to (https://bytemeta.vip/repo/scverse/scanpy/issues/2239)

Jun 25 '22 14:06 Gbighe

Got this on scanpy 1.9.1, did what #2239 suggested, as a temporary workaround

adata.uns['log1p']["base"] = None

Oct 12 '22 13:10 maximilianh

It looks like the adata.uns['log1p']["base"] = None is lost at the next reading.

Using an h5ad data set saved in result_file, the following happens:

adata = sc.read_h5ad(results_file)
adata.uns['log1p']['base'] is None
#---------------------------------------------------------------------------
#KeyError                                  Traceback (most recent call last)
#/home/sebastien/workshop_scRNA-Seq_UCLA/scanpy_tut3.py in line 2
#      [153](file:///home/sebastien/workshop_scRNA-Seq_UCLA/scanpy_tut3.py?line=152) adata = sc.read(results_file)
#----> [154](file:///home/sebastien/workshop_scRNA-Seq_UCLA/scanpy_tut3.py?line=153) adata.uns['log1p']['base'] is None

#
#KeyError: 'base'

adata.uns['log1p']["base"] = None # see https://github.com/scverse/scanpy/issues/2239
adata.uns['log1p']['base'] is None
# now returns True

adata.write(results_file)
adata.uns['log1p']['base'] is None
# still returns True

adata = sc.read_h5ad(results_file) # same with adata = sc.read(results_file)
adata.uns['log1p']['base'] is None
#
#KeyError: 'base'

This was with sc.__version__ 1.9.1

Dec 16 '22 15:12 sbwiecko

Hello, for information: I encountered the same problem with scanpy=1.9.1 either with anndata=0.8.0 or 0.7.8 or 0.7.6.

Jan 24 '23 11:01 hl-xue

I found it useful by calling scanpy.pp.log1p(adata) again before the function that returns the keyerror:base

For example, in the PBMC3K tutorial, calling this function again before step 43: Comparing to a single cluster. sc.pp.log1p(adata) sc.tl.rank_genes_groups(adata, 'leiden', groups=['0'], reference='1', method='wilcoxon') sc.pl.rank_genes_groups(adata, groups=['0'], n_genes=20)

I am new to scanpy so I do not know what exactly is going on here so all I can say is "for some reason" the code works this way

Jul 14 '23 05:07 Phil-Momoka

The log1p→base thing was fixed in #2546

Jul 18 '23 11:07 flying-sheep