scikit-learn icon indicating copy to clipboard operation
scikit-learn copied to clipboard

Spherical K-means support (unit norm centroids and input)

Open Radu1999 opened this issue 8 months ago • 10 comments

Describe the workflow you want to enable

Hi, I was wondering if there is—or has been—any initiative to support cosine similarity in the KMeans implementation (i.e., spherical KMeans). I find the algorithm quite useful and would be happy to propose an implementation. The addition should be relatively straightforward.

Describe your proposed solution

Enable the use of cosine similarity with KMeans or implement a separate SphericalKMeans class.

Describe alternatives you've considered, if relevant

No response

Additional context

No response

Radu1999 avatar May 28 '25 20:05 Radu1999

Hey I'd like to take up this @Radu1999

Namit24 avatar Jun 01 '25 10:06 Namit24

Hey I'd like to take up this @Radu1999

No, it's ok, I was planning to implement it once I confirm there is interest for it.

Radu1999 avatar Jun 02 '25 11:06 Radu1999

Fairs go for it then

Namit24 avatar Jun 02 '25 11:06 Namit24

@scikit-learn/core-devs any interest in this?

@Radu1999 to help evaluate this, could you provide some references and context that helps answer the questions from https://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms

betatim avatar Jun 04 '25 13:06 betatim

+1

rohnsha0 avatar Jun 06 '25 05:06 rohnsha0

@Radu1999 to help evaluate this, could you provide some references and context that helps answer the questions from https://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms

@betatim the paper has more than 200+ citations and it is published in 2012.. IMO it excels at clustering normalized, directional data (like text), where vector direction matters more than magnitude.

rohnsha0 avatar Jun 06 '25 05:06 rohnsha0

@betatim I'll like to take this!

rohnsha0 avatar Jun 06 '25 05:06 rohnsha0

@scikit-learn/core-devs any interest in this?

I'm not sure that I would consider this as a priority

GaelVaroquaux avatar Jun 10 '25 19:06 GaelVaroquaux

I'd say with a small maintainable implementation, I'd be happy to have it.

adrinjalali avatar Jun 11 '25 12:06 adrinjalali

I'd suggest adding configurable distance metric with 'euclidean' by default into existing Kmeans, rather than implementing a separate class. Just like here: https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/clustering.html#KMeans

Radu1999 avatar Jun 11 '25 12:06 Radu1999