rapids_singlecell icon indicating copy to clipboard operation
rapids_singlecell copied to clipboard

[neighbors] sparse matrix to ind/dist matrix pair: same as PyNNDescent?

Open flying-sheep opened this issue 2 years ago • 2 comments

We’ve been thinking about checking if Scanpy’s neighbors backends can benefit from the shortcut, but it seems like PyNNDescent behaves differently ((n_neighbors - 1) * n_obs instead of n_neighbors * n_obs)

Shouldn’t the two match up in behavior? Is this a cupy bug? I’m unsure where to harmonize this.

https://github.com/scverse/rapids_singlecell/blob/b038bcc923b4dd3d4874bae681b0f2cb919bccbf/src/rapids_singlecell/preprocessing/_neighbors.py#L131-L132

https://github.com/scverse/scanpy/blob/616d580384c1daa8115a270220aafd8c09f9f0f1/scanpy/neighbors/_common.py#L64-L81

flying-sheep avatar Oct 12 '23 09:10 flying-sheep

This is not a bug but intended. cuML behaves like scikit-learn. When I started working on neighbors I also went for (n_neighbors - 1) * n_obs. But for rsc n_neighbors * n_obs is important for the umap shortcut. I can think about adding in the 0 zeros in UMAP and mimic PyNNDescent.

Intron7 avatar Oct 12 '23 09:10 Intron7

No need, I just wanted to start the discussion about what is “correct”.

Seems like this isn’t so easy to answer, see relevant note from that linked document:

Note

When a specific number of neighbors is queried (using KNeighborsTransformer), the definition of n_neighbors is ambiguous since it can either include each training point as its own neighbor, or exclude them. Neither choice is perfect, since including them leads to a different number of non-self neighbors during training and testing, while excluding them leads to a difference between fit(X).transform(X) and fit_transform(X), which is against scikit-learn API. In KNeighborsTransformer we use the definition which includes each training point as its own neighbor in the count of n_neighbors. However, for compatibility reasons with other estimators which use the other definition, one extra neighbor will be computed when mode == 'distance'. To maximise compatibility with all estimators, a safe choice is to always include one extra neighbor in a custom nearest neighbors estimator, since unnecessary neighbors will be filtered by following estimators.

I assume “one extra neighbor” means “per cell, calculate n_neighbors neighbors that aren’t the same cell”, resulting in a sparse matrix with n_neighbors non-zero entries per row. Our code then creates knn_* matrices with n_neighbors columns that include that “additional” neighbor, but not the implicit, zero-distance, “self” neighbor.

So if PyNNDescent has n_neighbors - 1 entries per row in the sparse matrix, that means it doesn’t follow the sklearn advice and interprets n_neighbors as “number of neighbors including the same cell”. So using your code on PyNNDescent’s output returns two matrices with n_neighbors - 1 columns each.

cc @eroell @ivirshup

flying-sheep avatar Oct 12 '23 11:10 flying-sheep