tidyclust icon indicating copy to clipboard operation
tidyclust copied to clipboard

Metrics ideas

Open kbodwin opened this issue 4 years ago • 1 comments

What metrics might we use to validate clusterings?

Metrics that can be computed from a single cluster fit:

  • The within sum of squares to between sum of squares ratio, which measures relative cluster tightness.
  • The likelihood ratio, for model-based clustering results.
  • Cluster consistency across different randomized initial conditions
  • Cluster consistency across parameter choices

Metrics that can be computed from a collection of resampling results:

  • Cluster consistency across different data subsamplings.

Semi-supervised metrics:

  • Cluster agreement with observed external/response variables.

kbodwin avatar Nov 20 '21 00:11 kbodwin

For clustering assignment similarities, flexclust::randIndex is easy to use with the results of predict.

I like the idea of using this to measure consistency. However, this is very different to cross-validating or resampling in a modeling context, because we'd be comparing predictions across resamplings rather than to a ground truth. And it doesn't quite fit, because the resamplings would be different sets of observations, and the Rand Index can only compare two clustering assignments on the same observations.

I think this is a bigger thing to think about in the future: cluster consistency as a concept, that doesn't quite fit the traditional tuning/metrics paradigm.

kbodwin avatar Jul 20 '22 08:07 kbodwin