[Usage Question] Labels for "top k" best cluster assignments
Hello,
I am using kshape for time series data, and here are some general algorithm-agnostic clustering questions I have:
-
I would like to obtain the top-m best assignments, not just the top-1 as found in labels_. So labels_ would be of size (N x m) instead of (N,), with m <= K.
-
Inversely, I would like to obtain the top-m best samples for each cluster, i.e. the m samples most similar to a cluster's centroid. This would be a array of size (K x m), with m<=N.
-
To somewhat resume from point #1 and #2, I would like to obtain distance matrix from all samples to all clusters (N x K). This matrix by itself should allow me to compute for the quantities desired in #1 and #2. I see there is a dist matrix used in the source code. Is there an easy way to access it through the API without hacking the source code?
Hi,
The labels_ are calculated from a distance matrix by taking the argmin for each row (https://github.com/tslearn-team/tslearn/blob/main/tslearn/clustering/kshape.py#L153). I think you can modify that function to get a top-M instead of top-1.
Thanks for the answer! API design wise, should M be a parameter that can be passed into fit() and predict()? And should the distance matrix be accessible by users? I thought it would be a common use case and was surprised that the API did not have it already.
Hi there,
We tend to follow the API from sklearn and for typical clustering methods that rely on cross-distance matrices, eg k-means, they do not disclose this distance matrix.
But I understand that it can be useful at times.