[BUG] umap needs init_pos = 'random' for larger datasets
Describe the bug
rapids_singlecell.tl.umap will occasionally generate embeddings with spuriously large values (e.g., a small number of points that are very large).
Details: I start with a slice of the Tabula Muris data, and follow the standard workflow for generating a UMAP embedding: filter, highly_variable_genes, pca, neighbors, umap. When I use the ScanPy CPU implementation of UMAP, things look as expected. However, if I use the RSC tl.umap, I get embeddings which have a small number of points with very large values (i.e., appear as small, very dispersed clusters).
This occurs regardless of which implementation I use for the other steps (e.g., PCA) - only the call to umap appears to vary in its behavior.
Example plot using scanpy.tl.umap:
Example plot using rapids_singlecell.tl.umap:
Running numpy.histogram on the embedding shows the effect. Scanpy generated umap, points are well distributed:
# bin counts
array([ 495, 1192, 2037, 6329, 7213, 7952, 11369, 12860, 8011,
7390, 6820, 7103, 7795, 6455, 5127, 3900, 2394, 1746,
347, 15])
# bin edges
array([-11.400737 , -9.587617 , -7.7744966 , -5.961376 ,
-4.148256 , -2.3351357 , -0.52201545, 1.2911048 ,
3.104225 , 4.917345 , 6.7304654 , 8.543586 ,
10.356706 , 12.1698265 , 13.982946 , 15.796066 ,
17.609186 , 19.422306 , 21.235428 , 23.048548 ,
24.861668 ], dtype=float32)
RSC generated umap, small number of outliers:
# bin counts
array([ 2, 1, 0, 0, 0, 0, 5, 2, 0,
1, 14762, 91773, 2, 0, 0, 0, 1, 0,
0, 1])
# bin edges
array([-1622.5352 , -1475.78 , -1329.025 , -1182.2699 , -1035.5148 ,
-888.75977, -742.00464, -595.2496 , -448.49448, -301.7394 ,
-154.98431, -8.22923, 138.52585, 285.28094, 432.036 ,
578.79114, 725.5462 , 872.3013 , 1019.05634, 1165.8114 ,
1312.5665 ], dtype=float32)
I have found that this issue only occurs with specific data. For example, some slices of the above dataset generate good embeddings, others do not. For datasets where it fails, it fails reliably (so it appears to be something in the dataset or parameterization that triggers the bug).
When this occurs, the actual embedding in the central mass of points looks good. It is only the additional outliers which are problematic.
Steps/Code to reproduce bug
Installed on Linux (popos/ubuntu), using mamba and the rsc_rapids_24.04.yml recipe. I have a sample dataset that reliably fails and which I can make available.
Test code:
import rapids_singlecell as rsc
import scanpy as sc
import numpy as np
adata = sc.read_h5ad("fooz.h5ad")
def compute_embedding(adata: sc.AnnData):
# config
n_neighbors = 15
n_top_genes = 5000
min_genes = 10
min_cells = 10
sc.pp.filter_cells(adata, min_genes=min_genes)
sc.pp.filter_genes(adata, min_cells=min_cells)
if adata.n_obs == 0 or adata.n_vars == 0:
raise ValueError(f"Empty query result")
if adata.n_obs <= n_neighbors:
raise ValueError(f"Insufficient cells to compute KNN")
rsc.get.anndata_to_GPU(adata)
rsc.pp.highly_variable_genes(adata, n_top_genes=n_top_genes)
rsc.tl.pca(adata)
rsc.pp.neighbors(adata, n_neighbors=n_neighbors)
rsc.get.anndata_to_CPU(adata)
adata_CPU = adata.copy()
sc.tl.umap(adata_CPU)
adata_GPU = adata.copy()
rsc.get.anndata_to_GPU(adata_GPU)
rsc.tl.umap(adata_GPU)
rsc.get.anndata_to_CPU(adata_GPU)
return adata_CPU, adata_GPU
def summarize_umap(adata, save):
print(repr(adata))
print(f"X_umap, min={adata.obsm['X_umap'].min()}, max={adata.obsm['X_umap'].max()}")
hist, bins = np.histogram(adata.obsm["X_umap"].flatten(), bins=20)
print(repr(hist))
print(repr(bins))
sc.pl.umap(adata, color="cell_ontology_class", save=save, show=False)
print()
adata_CPU, adata_GPU = compute_embedding(adata.copy())
print("UMAP via RAPIDS")
summarize_umap(adata_GPU, "gpu.png")
print("UMAP via CPU")
summarize_umap(adata_CPU, "cpu.png")
Environment details (please complete the following information):
- Environment location: bare metal Lenovo/intel laptop
- Linux Distro/Architecture: PopOS latest (aka Ubuntu 22.04 amd64)
- GPU Model/Driver: GTX 3050 (GA 107M)
- CUDA: 11.4
- Method of Rapids install: mamba, using the rsc_rapids_24.04.yml from GH
pip list
Package Version ----------------------- ---------------- aiobotocore 2.12.3 aiohttp 3.9.5 aioitertools 0.11.0 aiosignal 1.3.1 anndata 0.10.7 array_api_compat 1.6 asttokens 2.4.1 async-timeout 4.0.3 attrs 23.2.0 bcc 0.18.0 blinker 1.4 botocore 1.34.69 Brlapi 0.8.3 cellxgene-census 1.13.0 certifi 2020.6.20 chardet 4.0.0 chrome-gnome-shell 0.0.0 click 8.0.3 cloudpickle 2.2.1 colorama 0.4.4 comm 0.2.2 command-not-found 0.3 contourpy 1.2.1 cryptography 3.4.8 cupshelpers 1.0 cycler 0.12.1 dbus-python 1.2.18 decorator 5.1.1 defer 1.0.6 distro 1.7.0 exceptiongroup 1.2.1 executing 2.0.1 fonttools 4.51.0 frozenlist 1.4.1 fsspec 2024.3.1 h5py 3.11.0 hidpidaemon 18.4.6 httplib2 0.20.2 idna 3.3 importlib-metadata 4.6.4 ipython 8.24.0 ipywidgets 8.1.2 jedi 0.19.1 jeepney 0.7.1 jmespath 1.0.1 joblib 1.4.2 jupyterlab_widgets 3.0.10 kernelstub 3.1.4 keyring 23.5.0 kiwisolver 1.4.5 language-selector 0.1 launchpadlib 1.10.16 lazr.restfulclient 0.14.4 lazr.uri 1.0.6 legacy-api-wrap 1.4 llvmlite 0.42.0 louis 3.20.0 macaroonbakery 1.3.1 matplotlib 3.8.4 matplotlib-inline 0.1.7 more-itertools 8.10.0 multidict 6.0.5 natsort 8.4.0 netaddr 0.8.0 netifaces 0.11.0 networkx 3.3 numba 0.59.1 numpy 1.26.4 oauthlib 3.2.0 packaging 24.0 pandas 2.2.2 parso 0.8.4 patsy 0.5.6 pexpect 4.9.0 pillow 10.3.0 pip 22.0.2 pop-transition 1.1.2 prompt-toolkit 3.0.43 protobuf 3.12.4 ptyprocess 0.7.0 pure-eval 0.2.2 pyarrow 16.0.0 pyarrow-hotfix 0.6 pycairo 1.20.1 pycups 2.0.1 pydbus 0.6.0 Pygments 2.18.0 PyGObject 3.42.1 PyJWT 2.3.0 pymacaroons 0.13.0 PyNaCl 1.5.0 pynndescent 0.5.12 pyparsing 2.4.7 pyRFC3339 1.1 python-apt 2.4.0+ubuntu3 python-dateutil 2.9.0.post0 python-debian 0.1.43+ubuntu1.1 python-gnupg 0.4.8 python-xlib 0.29 pytz 2022.1 pyxdg 0.27 PyYAML 5.4.1 repolib 2.2.1 repoman 1.4.0 requests 2.25.1 s3fs 2024.3.1 scanpy 1.10.1 scikit-learn 1.4.2 scipy 1.13.0 screen-resolution-extra 0.0.0 seaborn 0.13.2 SecretStorage 3.3.1 session-info 1.0.0 sessioninstaller 0.0.0 setuptools 59.6.0 six 1.16.0 somacore 1.0.10 stack-data 0.6.3 statsmodels 0.14.2 stdlib-list 0.10.0 systemd-python 234 tblib 1.7.0 threadpoolctl 3.5.0 tiledb 0.27.1 tiledb-cloud 0.12.7 tiledbsoma 1.9.5 tqdm 4.66.4 traitlets 5.14.3 typing_extensions 4.11.0 tzdata 2024.1 ubuntu-drivers-common 0.0.0 ufw 0.36.1 umap-learn 0.5.6 urllib3 1.26.5 wadllib 1.3.6 wcwidth 0.2.13 wheel 0.37.1 widgetsnbextension 4.0.10 wrapt 1.16.0 xdg 5 xkit 0.0.0 yarl 1.9.4 zipp 1.0.0
Please try setting 'init_pos="random"'. This usually works for me.
Yep, that seems to resolve it. Known issue with spectral init?
I don't know actually. I have ran into this too for very large datasets (1M or more). I will forward this to the rapids team at NVIDIA.
I updated the title so people know how to fix it. I'll leave this open for know. I'll also update UMAP so that I has a new default with auto that sets the init_pos to random if you have more than 1M cells.
The issue remains when set init_pos = 'random'
There will be a further update with rapids-24.10. I hope this will completely eliminate the problem.
There will be a further update with rapids-24.10. I hope this will completely eliminate the problem.
Thanks for your quick reply!
There will be a further update with rapids-24.10. I hope this will completely eliminate the problem.
Hi, have this be updated?
@Intron7 https://github.com/rapidsai/cuml/issues/6045. Unless this issue is fixed, we should remove init_pos=spectral for all UMAP computations.
For reference, the RAPIDS team is actively working to fix the issue with spectral initialization. The first of two PRs required to fix this has been posted to the RAFT repo here. I cannot make any promises about timeline, but I will be working on the corresponding cuML PR this week.
With https://github.com/rapidsai/raft/pull/2568 merged, the cuML 25.04 nightlies should now have a fix for the underlying spectral initialization issue.
For most use cases, this will be enough to eliminate the "flyaway points" that were mentioned in the initial bug report. In some cases, though, you will still need to increase n_epochs somewhat relative to the value you would select when using UMAP-Learn in order to completely eliminate those flyaways. If n_epochs is high enough that you do not see this issue with random initialization, though, it should definitely be high enough for spectral initialization. Please let us know if you continue to see flyaways with spectral initialization anywhere that you don't see them with random initialization.