cuvs
cuvs copied to clipboard
[BUG] `sample_rows` + balanced k-means leads to imbalanced clusters on BIGANN 1B
Multiple routines use raft::matrix::sample_rows() followed by a balanced cuvs::cluster::kmeans::fit() including all_neighbors::get_centroids_on_data_subsample(), ivf_pq::build(), scann::build(), and ACE introduced in #1404. Testings this PR with BIGANN 1B and 1% (10M) samples shows high imbalances:
Primary vectors - Total: 1000000000, Avg: 1000000.0, Min: 160947, Max: 18829503
Augmented vectors - Total: 1000000000, Avg: 1000000.0, Min: 153915, Max: 13578909
Total per partition - Total: 2000000000, Avg: 2000000.0, Min: 323707, Max: 32408412
This can lead to OOM issues in partitioned approaches.
Uniform sampling (see cagra::ace_get_partition_labels introduced in #1404) shows much better balancing:
Primary vectors - Total: 1000000000, Avg: 1000000.0, Min: 519219, Max: 3040985
Augmented vectors - Total: 1000000000, Avg: 1000000.0, Min: 265749, Max: 2634495
Total per partition - Total: 2000000000, Avg: 2000000.0, Min: 784968, Max: 5378950