tslearn icon indicating copy to clipboard operation
tslearn copied to clipboard

[Usage Question] Labels for "top k" best cluster assignments

Open yutaizhou opened this issue 4 years ago • 3 comments

Hello,

I am using kshape for time series data, and here are some general algorithm-agnostic clustering questions I have:

  1. I would like to obtain the top-m best assignments, not just the top-1 as found in labels_. So labels_ would be of size (N x m) instead of (N,), with m <= K.

  2. Inversely, I would like to obtain the top-m best samples for each cluster, i.e. the m samples most similar to a cluster's centroid. This would be a array of size (K x m), with m<=N.

  3. To somewhat resume from point #1 and #2, I would like to obtain distance matrix from all samples to all clusters (N x K). This matrix by itself should allow me to compute for the quantities desired in #1 and #2. I see there is a dist matrix used in the source code. Is there an easy way to access it through the API without hacking the source code?

yutaizhou avatar Mar 08 '21 19:03 yutaizhou

Hi,

The labels_ are calculated from a distance matrix by taking the argmin for each row (https://github.com/tslearn-team/tslearn/blob/main/tslearn/clustering/kshape.py#L153). I think you can modify that function to get a top-M instead of top-1.

GillesVandewiele avatar Mar 09 '21 07:03 GillesVandewiele

Thanks for the answer! API design wise, should M be a parameter that can be passed into fit() and predict()? And should the distance matrix be accessible by users? I thought it would be a common use case and was surprised that the API did not have it already.

yutaizhou avatar Mar 09 '21 13:03 yutaizhou

Hi there,

We tend to follow the API from sklearn and for typical clustering methods that rely on cross-distance matrices, eg k-means, they do not disclose this distance matrix.

But I understand that it can be useful at times.

rtavenar avatar May 17 '21 09:05 rtavenar