scanpy icon indicating copy to clipboard operation
scanpy copied to clipboard

improve sc.pp.neighbors parameter docs

Open wangjiawen2013 opened this issue 7 years ago • 10 comments

Dear, Seurat embed cells in a graph structure - for example a K-nearest neighbor (KNN) graph, with edges drawn between cells with similar gene expression patterns, and then attempt to partition this graph into highly interconnected ‘quasi-cliques’ or ‘communities’. A KNN graph is first constructed based on the euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity). Does Scanpy uses Jaccard similarity to cluster too?

wangjiawen2013 avatar Sep 30 '18 12:09 wangjiawen2013

From my understanding, Seurat uses the Louvain algorithm to maximize modularity on their KNN where edge weights are Jaccard similarity between the relevant nodes' neighbors.

Scanpy is largely similar by default, though the nearest neighbors are found by UMAP's method, and all edge weights are 1. There's a boolean flag to use edge weights if you'd like (use_weights), which uses the weights from the adjacency matrix found in: adata.uns["neighbors"]["connectivities"].

If you'd like to know more, I'd recommend looking at the docs for sc.pp.neighbors, sc.tl.louvain, and the clustering tutorials.

ivirshup avatar Oct 03 '18 05:10 ivirshup

I assume UMAP uses a number of top PCs as input as well, no? And do you know if distances are calculated only on 2 UMAP components, or if more are used?

LuckyMD avatar Oct 03 '18 11:10 LuckyMD

Part of the UMAP computation is creating an approximate KNN graph. Scanpy uses that part of the UMAP package to create it's neighborhood graph. UMAP uses whatever distance metric and feature space the user specifies for finding neighbors. Scanpy uses euclidean distance in PCA space as the default for finding neighbors, but that can be changed.

I hope that's more clear.

ivirshup avatar Oct 04 '18 00:10 ivirshup

So the method='umap' has nothing to do with using UMAP dimensionality reduction as a distance metric, but instead only refers to a KNN-graph implementation? It seems to me that making a KNN-graph from the Euclidean distances should be easy.

LuckyMD avatar Oct 04 '18 08:10 LuckyMD

Yeah, that sounds right.

It's super easy to code, you'd just need to calculate all pairwise distances which is ~O(n^2), and similar for memory. To get around that there's a number of approximate nearest neighbors methods.

I'm not too sure what the assumptions are behind each method though. @falexwolf, any reason in particular you've chosen UMAP's method for the KNN calculation?

ivirshup avatar Oct 05 '18 02:10 ivirshup

I'm not to sure what the assumptions are behind each method though. @falexwolf, any reason in particular you've chosen UMAP's method for the KNN calculation?

It's highly competitive in terms of speed and accuracy with other libraries (https://github.com/erikbern/ann-benchmarks, pynndescent is what umap uses, wasn't available at the time for Scanpy), it's a lot easier to install than everything else, and the result has been shown to harmonize well with UMAP, which I expected would become the canonical way of visualizing things.

falexwolf avatar Oct 05 '18 11:10 falexwolf

Thanks for the explanation! I had no idea this was an important bottleneck. I always assumed all distances are calculated... that shows how little I think about optimization ^^.

On another note, it might be a little confusing to see method='umap' as a parameter. I would have immediately assumed that umap is used as a distance metric. But maybe that's me being too lazy to read as well.

LuckyMD avatar Oct 05 '18 14:10 LuckyMD

Is this documented with this level of clarity anywhere else, other than in this thread?

carmensandoval avatar Mar 02 '23 18:03 carmensandoval

I had the same intuition when coming across methods as "umap".

(https://github.com/erikbern/ann-benchmarks, pynndescent is what umap uses, wasn't available at the time for Scanpy)

Is pynndescent is available now to scanpy? If so, can we update the documentation and "method" parameter options for sc.pp.neighbors? Wording such as "The neighbor search efficiency of this heavily relies on UMAP" could be potentially misleading. Thanks.

Jieran-S avatar Oct 09 '25 14:10 Jieran-S

Good point, method should really have cleaner docs. Anyone up for making a PR?

flying-sheep avatar Oct 10 '25 07:10 flying-sheep