rapids_singlecell icon indicating copy to clipboard operation
rapids_singlecell copied to clipboard

[BUG] umap needs init_pos = 'random' for larger datasets

Open bkmartinjr opened this issue 1 year ago • 11 comments

Describe the bug

rapids_singlecell.tl.umap will occasionally generate embeddings with spuriously large values (e.g., a small number of points that are very large).

Details: I start with a slice of the Tabula Muris data, and follow the standard workflow for generating a UMAP embedding: filter, highly_variable_genes, pca, neighbors, umap. When I use the ScanPy CPU implementation of UMAP, things look as expected. However, if I use the RSC tl.umap, I get embeddings which have a small number of points with very large values (i.e., appear as small, very dispersed clusters).

This occurs regardless of which implementation I use for the other steps (e.g., PCA) - only the call to umap appears to vary in its behavior.

Example plot using scanpy.tl.umap: image

Example plot using rapids_singlecell.tl.umap: image

Running numpy.histogram on the embedding shows the effect. Scanpy generated umap, points are well distributed:

# bin counts
array([  495,  1192,  2037,  6329,  7213,  7952, 11369, 12860,  8011,
        7390,  6820,  7103,  7795,  6455,  5127,  3900,  2394,  1746,
         347,    15])
# bin edges
array([-11.400737  ,  -9.587617  ,  -7.7744966 ,  -5.961376  ,
        -4.148256  ,  -2.3351357 ,  -0.52201545,   1.2911048 ,
         3.104225  ,   4.917345  ,   6.7304654 ,   8.543586  ,
        10.356706  ,  12.1698265 ,  13.982946  ,  15.796066  ,
        17.609186  ,  19.422306  ,  21.235428  ,  23.048548  ,
        24.861668  ], dtype=float32)

RSC generated umap, small number of outliers:

# bin counts
array([    2,     1,     0,     0,     0,     0,     5,     2,     0,
           1, 14762, 91773,     2,     0,     0,     0,     1,     0,
           0,     1])
# bin edges
array([-1622.5352 , -1475.78   , -1329.025  , -1182.2699 , -1035.5148 ,
        -888.75977,  -742.00464,  -595.2496 ,  -448.49448,  -301.7394 ,
        -154.98431,    -8.22923,   138.52585,   285.28094,   432.036  ,
         578.79114,   725.5462 ,   872.3013 ,  1019.05634,  1165.8114 ,
        1312.5665 ], dtype=float32)

I have found that this issue only occurs with specific data. For example, some slices of the above dataset generate good embeddings, others do not. For datasets where it fails, it fails reliably (so it appears to be something in the dataset or parameterization that triggers the bug).

When this occurs, the actual embedding in the central mass of points looks good. It is only the additional outliers which are problematic.

Steps/Code to reproduce bug

Installed on Linux (popos/ubuntu), using mamba and the rsc_rapids_24.04.yml recipe. I have a sample dataset that reliably fails and which I can make available.

Test code:

import rapids_singlecell as rsc
import scanpy as sc
import numpy as np

adata = sc.read_h5ad("fooz.h5ad")

def compute_embedding(adata: sc.AnnData):
    # config
    n_neighbors = 15
    n_top_genes = 5000
    min_genes = 10
    min_cells = 10

    sc.pp.filter_cells(adata, min_genes=min_genes)
    sc.pp.filter_genes(adata, min_cells=min_cells)

    if adata.n_obs == 0 or adata.n_vars == 0:
        raise ValueError(f"Empty query result")

    if adata.n_obs <= n_neighbors:
        raise ValueError(f"Insufficient cells to compute KNN")

    rsc.get.anndata_to_GPU(adata)
    rsc.pp.highly_variable_genes(adata, n_top_genes=n_top_genes)
    rsc.tl.pca(adata)
    rsc.pp.neighbors(adata, n_neighbors=n_neighbors)
    rsc.get.anndata_to_CPU(adata)

    adata_CPU = adata.copy()
    sc.tl.umap(adata_CPU)

    adata_GPU = adata.copy()
    rsc.get.anndata_to_GPU(adata_GPU)
    rsc.tl.umap(adata_GPU)
    rsc.get.anndata_to_CPU(adata_GPU)

    return adata_CPU, adata_GPU


def summarize_umap(adata, save):
    print(repr(adata))
    print(f"X_umap, min={adata.obsm['X_umap'].min()}, max={adata.obsm['X_umap'].max()}")
    hist, bins = np.histogram(adata.obsm["X_umap"].flatten(), bins=20)
    print(repr(hist))
    print(repr(bins))
    sc.pl.umap(adata, color="cell_ontology_class", save=save, show=False)
    print()


adata_CPU, adata_GPU = compute_embedding(adata.copy())


print("UMAP via RAPIDS")
summarize_umap(adata_GPU, "gpu.png")

print("UMAP via CPU")
summarize_umap(adata_CPU, "cpu.png")

Environment details (please complete the following information):

  • Environment location: bare metal Lenovo/intel laptop
  • Linux Distro/Architecture: PopOS latest (aka Ubuntu 22.04 amd64)
  • GPU Model/Driver: GTX 3050 (GA 107M)
  • CUDA: 11.4
  • Method of Rapids install: mamba, using the rsc_rapids_24.04.yml from GH
pip list
Package                 Version
----------------------- ----------------
aiobotocore             2.12.3
aiohttp                 3.9.5
aioitertools            0.11.0
aiosignal               1.3.1
anndata                 0.10.7
array_api_compat        1.6
asttokens               2.4.1
async-timeout           4.0.3
attrs                   23.2.0
bcc                     0.18.0
blinker                 1.4
botocore                1.34.69
Brlapi                  0.8.3
cellxgene-census        1.13.0
certifi                 2020.6.20
chardet                 4.0.0
chrome-gnome-shell      0.0.0
click                   8.0.3
cloudpickle             2.2.1
colorama                0.4.4
comm                    0.2.2
command-not-found       0.3
contourpy               1.2.1
cryptography            3.4.8
cupshelpers             1.0
cycler                  0.12.1
dbus-python             1.2.18
decorator               5.1.1
defer                   1.0.6
distro                  1.7.0
exceptiongroup          1.2.1
executing               2.0.1
fonttools               4.51.0
frozenlist              1.4.1
fsspec                  2024.3.1
h5py                    3.11.0
hidpidaemon             18.4.6
httplib2                0.20.2
idna                    3.3
importlib-metadata      4.6.4
ipython                 8.24.0
ipywidgets              8.1.2
jedi                    0.19.1
jeepney                 0.7.1
jmespath                1.0.1
joblib                  1.4.2
jupyterlab_widgets      3.0.10
kernelstub              3.1.4
keyring                 23.5.0
kiwisolver              1.4.5
language-selector       0.1
launchpadlib            1.10.16
lazr.restfulclient      0.14.4
lazr.uri                1.0.6
legacy-api-wrap         1.4
llvmlite                0.42.0
louis                   3.20.0
macaroonbakery          1.3.1
matplotlib              3.8.4
matplotlib-inline       0.1.7
more-itertools          8.10.0
multidict               6.0.5
natsort                 8.4.0
netaddr                 0.8.0
netifaces               0.11.0
networkx                3.3
numba                   0.59.1
numpy                   1.26.4
oauthlib                3.2.0
packaging               24.0
pandas                  2.2.2
parso                   0.8.4
patsy                   0.5.6
pexpect                 4.9.0
pillow                  10.3.0
pip                     22.0.2
pop-transition          1.1.2
prompt-toolkit          3.0.43
protobuf                3.12.4
ptyprocess              0.7.0
pure-eval               0.2.2
pyarrow                 16.0.0
pyarrow-hotfix          0.6
pycairo                 1.20.1
pycups                  2.0.1
pydbus                  0.6.0
Pygments                2.18.0
PyGObject               3.42.1
PyJWT                   2.3.0
pymacaroons             0.13.0
PyNaCl                  1.5.0
pynndescent             0.5.12
pyparsing               2.4.7
pyRFC3339               1.1
python-apt              2.4.0+ubuntu3
python-dateutil         2.9.0.post0
python-debian           0.1.43+ubuntu1.1
python-gnupg            0.4.8
python-xlib             0.29
pytz                    2022.1
pyxdg                   0.27
PyYAML                  5.4.1
repolib                 2.2.1
repoman                 1.4.0
requests                2.25.1
s3fs                    2024.3.1
scanpy                  1.10.1
scikit-learn            1.4.2
scipy                   1.13.0
screen-resolution-extra 0.0.0
seaborn                 0.13.2
SecretStorage           3.3.1
session-info            1.0.0
sessioninstaller        0.0.0
setuptools              59.6.0
six                     1.16.0
somacore                1.0.10
stack-data              0.6.3
statsmodels             0.14.2
stdlib-list             0.10.0
systemd-python          234
tblib                   1.7.0
threadpoolctl           3.5.0
tiledb                  0.27.1
tiledb-cloud            0.12.7
tiledbsoma              1.9.5
tqdm                    4.66.4
traitlets               5.14.3
typing_extensions       4.11.0
tzdata                  2024.1
ubuntu-drivers-common   0.0.0
ufw                     0.36.1
umap-learn              0.5.6
urllib3                 1.26.5
wadllib                 1.3.6
wcwidth                 0.2.13
wheel                   0.37.1
widgetsnbextension      4.0.10
wrapt                   1.16.0
xdg                     5
xkit                    0.0.0
yarl                    1.9.4
zipp                    1.0.0

bkmartinjr avatar May 16 '24 19:05 bkmartinjr

Please try setting 'init_pos="random"'. This usually works for me.

Intron7 avatar May 16 '24 19:05 Intron7

Yep, that seems to resolve it. Known issue with spectral init?

bkmartinjr avatar May 16 '24 19:05 bkmartinjr

I don't know actually. I have ran into this too for very large datasets (1M or more). I will forward this to the rapids team at NVIDIA.

Intron7 avatar May 16 '24 19:05 Intron7

I updated the title so people know how to fix it. I'll leave this open for know. I'll also update UMAP so that I has a new default with auto that sets the init_pos to random if you have more than 1M cells.

Intron7 avatar May 16 '24 20:05 Intron7

The issue remains when set init_pos = 'random'

SNOL2 avatar Oct 09 '24 06:10 SNOL2

There will be a further update with rapids-24.10. I hope this will completely eliminate the problem.

Intron7 avatar Oct 09 '24 07:10 Intron7

There will be a further update with rapids-24.10. I hope this will completely eliminate the problem.

Thanks for your quick reply!

SNOL2 avatar Oct 10 '24 06:10 SNOL2

There will be a further update with rapids-24.10. I hope this will completely eliminate the problem.

Hi, have this be updated?

SNOL2 avatar Dec 02 '24 04:12 SNOL2

@Intron7 https://github.com/rapidsai/cuml/issues/6045. Unless this issue is fixed, we should remove init_pos=spectral for all UMAP computations.

abs51295 avatar Jan 29 '25 02:01 abs51295

For reference, the RAPIDS team is actively working to fix the issue with spectral initialization. The first of two PRs required to fix this has been posted to the RAFT repo here. I cannot make any promises about timeline, but I will be working on the corresponding cuML PR this week.

wphicks avatar Feb 03 '25 22:02 wphicks

With https://github.com/rapidsai/raft/pull/2568 merged, the cuML 25.04 nightlies should now have a fix for the underlying spectral initialization issue.

For most use cases, this will be enough to eliminate the "flyaway points" that were mentioned in the initial bug report. In some cases, though, you will still need to increase n_epochs somewhat relative to the value you would select when using UMAP-Learn in order to completely eliminate those flyaways. If n_epochs is high enough that you do not see this issue with random initialization, though, it should definitely be high enough for spectral initialization. Please let us know if you continue to see flyaways with spectral initialization anywhere that you don't see them with random initialization.

wphicks avatar Feb 19 '25 17:02 wphicks