scanpy icon indicating copy to clipboard operation
scanpy copied to clipboard

`read_10x_mtx()` cannot handle numerical barcodes properly (BD Rhapsody specifically)

Open pormr opened this issue 9 months ago • 1 comments

Please make sure these conditions are met

  • [x] I have checked that this issue has not already been reported.
  • [x] I have confirmed this bug exists on the latest version of scanpy.
  • [ ] (optional) I have confirmed this bug exists on the main branch of scanpy.

What happened?

Hello, I am trying to load BD Rhapsody data using read_10x_mtx(), but it seems that the function does not handle the numerical barcodes correctly. I checked the barcodes.tsv.gz file, and it contains numerical barcodes (integers) instead of the expected ACGT sequences. This is causing issues when I try to load the data into an AnnData object. According to the BD Rhapsody documentation, they use numerical cell IDs to distinguish between cells, which is different from the standard 10X Genomics format that uses string barcodes. Example data can be download from the BD Rhapsody website here, here I attached a small example of their GEX matrix file: BD-Demo-WTA-SMK_SampleTag03_hs_RSEC_MolsPerCell_MEX.zip.

Minimal code sample

>>> import scanpy as sc
>>> adata = sc.read_10x_mtx('data/SMK_SampleTag03')

Error output

<CONDA_PREFIX>/lib/python3.12/site-packages/anndata/_core/anndata.py:812: UserWarning:
AnnData expects .obs.index to contain strings, but got values like:
    [9265, 11954, 21560, 31507, 32668]

    Inferred to be: integer

  names = self._prep_dim_index(names, "obs")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<CONDA_PREFIX>/lib/python3.12/site-packages/legacy_api_wrap/__init__.py", line 82, in fn_compatible
    return fn(*args_all, **kw)
           ^^^^^^^^^^^^^^^^^^^
  File "<CONDA_PREFIX>/lib/python3.12/site-packages/scanpy/readwrite.py", line 597, in read_10x_mtx
    return adata[:, gex_rows].copy()
           ~~~~~^^^^^^^^^^^^^
  File "<CONDA_PREFIX>/lib/python3.12/site-packages/anndata/_core/anndata.py", line 1011, in __getitem__
    oidx, vidx = self._normalize_indices(index)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<CONDA_PREFIX>/lib/python3.12/site-packages/anndata/_core/anndata.py", line 992, in _normalize_indices
    return _normalize_indices(index, self.obs_names, self.var_names)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<CONDA_PREFIX>/lib/python3.12/site-packages/anndata/_core/index.py", line 32, in _normalize_indices
    ax0 = _normalize_index(ax0, names0)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<CONDA_PREFIX>/lib/python3.12/site-packages/anndata/_core/index.py", line 50, in _normalize_index
    assert index.dtype != int, msg
           ^^^^^^^^^^^^^^^^^^
AssertionError: Don’t call _normalize_index with non-categorical/string names

Versions

scanpy  1.11.1
----    ----
h5py    3.13.0
setuptools      80.1.0
colorama        0.4.6
session-info2   0.1.2
python-dateutil 2.9.0.post0
packaging       25.0
scikit-learn    1.5.2
numpy   2.2.5
typing_extensions       4.13.2
legacy-api-wrap 1.4.1
numba   0.61.2
llvmlite        0.44.0
six     1.17.0
matplotlib      3.10.1
joblib  1.5.0
pyparsing       3.2.3
cycler  0.12.1
pandas  2.2.3
pytz    2025.2
scipy   1.15.2
natsort 8.4.0
threadpoolctl   3.6.0
kiwisolver      1.4.8
anndata 0.11.4
pillow  11.1.0
----    ----
Python  3.12.10 | packaged by conda-forge | (main, Apr 10 2025, 22:21:13) [GCC 13.3.0]
OS      Linux-4.18.0-348.el8.x86_64-x86_64-with-glibc2.28
CPU     64 logical CPU cores, x86_64
GPU     No GPU found
Updated <SCRUBBED>

pormr avatar May 07 '25 13:05 pormr

I discovered that you can get around this by setting gex_only to False when calling read_10x_mtx():

adata = sc.read_10x_mtx(path, gex_only = False)

However, the warning message persists:

<SCRUBBED>/lib/python3.12/site-packages/anndata/_core/anndata.py:812: UserWarning:
AnnData expects .obs.index to contain strings, but got values like:
    [2534, 5269, 5661, 8881, 9730]

    Inferred to be: integer

  names = self._prep_dim_index(names, "obs")

I haven't test any compatibility issues regarding numerical barcodes in other functions, but I suspect that this issue might be more widespread. I would appreciate any help or suggestions on how to handle such situation.

pormr avatar May 07 '25 13:05 pormr