unicom icon indicating copy to clipboard operation
unicom copied to clipboard

About kmeans clustering

Open zheng-xing opened this issue 1 year ago • 2 comments

Hi,

First thanks for such a great work and making it open.

I notice in your paper you mentioned,

  • you can cluster 400 million samples into 1 million clustering within 10 minutes
  • Table 5, three cluster counts are mentioned, 100K, 1M, 10M

Can you add more details about which particular tools did you use for this clustering step? I am very curious as usually kmeans can only handle small cluster sizes.

Thanks very much.

zheng-xing avatar Apr 28 '24 03:04 zheng-xing

We utilized a cluster of 20 machines, each equipped with 8 V100 GPUs, for parallel hierarchical clustering. Each V100 was responsible for clustering 20 million images into 1 million cluster centroids. Subsequently, we aggregated the centroids from all 20 machines, each contributing 1 million centroids, into a final set of 1 million centroids.

The library employed for this operation was faiss-gpu.

anxiangsir avatar Apr 28 '24 06:04 anxiangsir

We utilized a cluster of 20 machines, each equipped with 8 V100 GPUs, for parallel hierarchical clustering. Each V100 was responsible for clustering 20 million images into 1 million cluster centroids. Subsequently, we aggregated the centroids from all 20 machines, each contributing 1 million centroids, into a final set of 1 million centroids.

The library employed for this operation was faiss-gpu.

Thank you for sharing. May I ask if this portion of the code can be made open source?

zhangluustb avatar May 23 '24 10:05 zhangluustb

https://github.com/facebookresearch/faiss/blob/main/demos/demo_distributed_kmeans_torch.py

anxiangsir avatar Mar 13 '25 08:03 anxiangsir