Metrics ideas
What metrics might we use to validate clusterings?
Metrics that can be computed from a single cluster fit:
- The within sum of squares to between sum of squares ratio, which measures relative cluster tightness.
- The likelihood ratio, for model-based clustering results.
- Cluster consistency across different randomized initial conditions
- Cluster consistency across parameter choices
Metrics that can be computed from a collection of resampling results:
- Cluster consistency across different data subsamplings.
Semi-supervised metrics:
- Cluster agreement with observed external/response variables.
For clustering assignment similarities, flexclust::randIndex is easy to use with the results of predict.
I like the idea of using this to measure consistency. However, this is very different to cross-validating or resampling in a modeling context, because we'd be comparing predictions across resamplings rather than to a ground truth. And it doesn't quite fit, because the resamplings would be different sets of observations, and the Rand Index can only compare two clustering assignments on the same observations.
I think this is a bigger thing to think about in the future: cluster consistency as a concept, that doesn't quite fit the traditional tuning/metrics paradigm.