Metrics ideas

Open kbodwin opened this issue 4 years ago • 1 comments

What metrics might we use to validate clusterings?

Metrics that can be computed from a single cluster fit:

The within sum of squares to between sum of squares ratio, which measures relative cluster tightness.
The likelihood ratio, for model-based clustering results.
Cluster consistency across different randomized initial conditions
Cluster consistency across parameter choices

Metrics that can be computed from a collection of resampling results:

Cluster consistency across different data subsamplings.

Semi-supervised metrics:

Cluster agreement with observed external/response variables.

Nov 20 '21 00:11 kbodwin

For clustering assignment similarities, flexclust::randIndex is easy to use with the results of predict.

I like the idea of using this to measure consistency. However, this is very different to cross-validating or resampling in a modeling context, because we'd be comparing predictions across resamplings rather than to a ground truth. And it doesn't quite fit, because the resamplings would be different sets of observations, and the Rand Index can only compare two clustering assignments on the same observations.

I think this is a bigger thing to think about in the future: cluster consistency as a concept, that doesn't quite fit the traditional tuning/metrics paradigm.

Jul 20 '22 08:07 kbodwin