scikit-learn Spherical K-means support (unit norm centroids and input)

Describe the workflow you want to enable

Hi, I was wondering if there is—or has been—any initiative to support cosine similarity in the KMeans implementation (i.e., spherical KMeans). I find the algorithm quite useful and would be happy to propose an implementation. The addition should be relatively straightforward.

Describe your proposed solution

Enable the use of cosine similarity with KMeans or implement a separate SphericalKMeans class.

Describe alternatives you've considered, if relevant

No response

Additional context

No response

May 28 '25 20:05 Radu1999

Hey I'd like to take up this @Radu1999

Jun 01 '25 10:06 Namit24

Hey I'd like to take up this @Radu1999

No, it's ok, I was planning to implement it once I confirm there is interest for it.

Jun 02 '25 11:06 Radu1999

Fairs go for it then

Jun 02 '25 11:06 Namit24

@scikit-learn/core-devs any interest in this?

@Radu1999 to help evaluate this, could you provide some references and context that helps answer the questions from https://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms

Jun 04 '25 13:06 betatim

+1

Jun 06 '25 05:06 rohnsha0

@Radu1999 to help evaluate this, could you provide some references and context that helps answer the questions from https://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms

@betatim the paper has more than 200+ citations and it is published in 2012.. IMO it excels at clustering normalized, directional data (like text), where vector direction matters more than magnitude.

Jun 06 '25 05:06 rohnsha0

@betatim I'll like to take this!

Jun 06 '25 05:06 rohnsha0

@scikit-learn/core-devs any interest in this?

I'm not sure that I would consider this as a priority

Jun 10 '25 19:06 GaelVaroquaux

I'd say with a small maintainable implementation, I'd be happy to have it.

Jun 11 '25 12:06 adrinjalali

I'd suggest adding configurable distance metric with 'euclidean' by default into existing Kmeans, rather than implementing a separate class. Just like here: https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/clustering.html#KMeans

Jun 11 '25 12:06 Radu1999