[neighbors] sparse matrix to ind/dist matrix pair: same as PyNNDescent?
We’ve been thinking about checking if Scanpy’s neighbors backends can benefit from the shortcut, but it seems like PyNNDescent behaves differently ((n_neighbors - 1) * n_obs instead of n_neighbors * n_obs)
Shouldn’t the two match up in behavior? Is this a cupy bug? I’m unsure where to harmonize this.
https://github.com/scverse/rapids_singlecell/blob/b038bcc923b4dd3d4874bae681b0f2cb919bccbf/src/rapids_singlecell/preprocessing/_neighbors.py#L131-L132
https://github.com/scverse/scanpy/blob/616d580384c1daa8115a270220aafd8c09f9f0f1/scanpy/neighbors/_common.py#L64-L81
This is not a bug but intended. cuML behaves like scikit-learn. When I started working on neighbors I also went for (n_neighbors - 1) * n_obs. But for rsc n_neighbors * n_obs is important for the umap shortcut. I can think about adding in the 0 zeros in UMAP and mimic PyNNDescent.
No need, I just wanted to start the discussion about what is “correct”.
Seems like this isn’t so easy to answer, see relevant note from that linked document:
Note
When a specific number of neighbors is queried (using KNeighborsTransformer), the definition of
n_neighborsis ambiguous since it can either include each training point as its own neighbor, or exclude them. Neither choice is perfect, since including them leads to a different number of non-self neighbors during training and testing, while excluding them leads to a difference betweenfit(X).transform(X)andfit_transform(X), which is against scikit-learn API. In KNeighborsTransformer we use the definition which includes each training point as its own neighbor in the count ofn_neighbors. However, for compatibility reasons with other estimators which use the other definition, one extra neighbor will be computed whenmode == 'distance'. To maximise compatibility with all estimators, a safe choice is to always include one extra neighbor in a custom nearest neighbors estimator, since unnecessary neighbors will be filtered by following estimators.
I assume “one extra neighbor” means “per cell, calculate n_neighbors neighbors that aren’t the same cell”, resulting in a sparse matrix with n_neighbors non-zero entries per row. Our code then creates knn_* matrices with n_neighbors columns that include that “additional” neighbor, but not the implicit, zero-distance, “self” neighbor.
So if PyNNDescent has n_neighbors - 1 entries per row in the sparse matrix, that means it doesn’t follow the sklearn advice and interprets n_neighbors as “number of neighbors including the same cell”. So using your code on PyNNDescent’s output returns two matrices with n_neighbors - 1 columns each.
cc @eroell @ivirshup