feat: cuvs acceleration for gpu k-means
We currently have a pytorch-based k-means implementation for computing IVF centroids. This PR accelerates it with cuVS. This uses a tradeoff of faster iterations/less score improvement per iteration.
By default, this is off, since it's primarily useful for very large datasets where large centroid counts are applicable.
Benchmarking (classic k-means scoring):
- k=16384 clusters
- text2image-10M base set (10M image embeddings, 200 dimensions, float32, cosine distance)
Results: Slightly better score @ ~1.5x faster. Speedup gets better with more centroids.
Easy test script & outputs
import numpy as np
from lance.cuvs.kmeans import KMeans as KMeansVS
from lance.torch.kmeans import KMeans
import lance
import time
# Note: This kind of approach performs quite poorly on random data (see https://arxiv.org/abs/2405.18680), so it's only worth testing on a real dataset
ds = lance.dataset("path/to/text2image-dataset") # can also use other medium~large datasets
data = np.stack(ds.to_table()["vector"].to_numpy())
max_iters_base = 10
max_iters_cuvs = 12 # iters using cuvs are much faster, but slightly less precise
metric = "cosine"
cuvs_start_time = time.time()
kmeans_cuvs = KMeansVS(
CLUSTERS,
metric=metric,
max_iters=max_iters_cuvs,
seed=0,
)
kmeans_cuvs.fit(data)
cuvs_end_time = time.time()
base_start_time = time.time()
kmeans = KMeans(
CLUSTERS,
metric=metric,
max_iters=max_iters_base,
seed=0,
)
kmeans.fit(data)
base_end_time = time.time()
print(f"score after {max_iters_cuvs} iters of kmeans_cuvs better than {max_iters_base} iters of kmeans by {kmeans.total_distance - kmeans_cuvs.total_distance}")
base_time = base_end_time-base_start_time
cuvs_time = cuvs_end_time-cuvs_start_time
print(f"time to run kmeans: {base_time}s. time to run kmeans_cuvs: {cuvs_time} (speedup: {base_time/cuvs_time}x)")
Output:
score after 12 iters of kmeans_cuvs better than 10 iters of kmeans by 5905.7138671875
time to run kmeans: 86.69116258621216s. time to run kmeans_cuvs: 56.66267776489258 (speedup: 1.5299517425899842x)
Additionally, a new "accelerator" choice has been added: "cuvs". This requires one of the added optional dependencies (cuvs-py3X, X in {9,10,11}). This can replace the two routines for which we already have cuda acceleration: IVF model training (Lloyd's algorithm) and IVF assignments. At sufficiently large centroid counts, this can significantly accelerate these steps, resulting in better e2e time. See below:
Although these plots are near-identical, the "cuvs" accelerated variation took ~18.1s to build e2e, while the "cuda" accelerated variation took ~24.4s.
This speedup persists on larger datasets, although I was mistaken in that PQ assignments are a bigger bottleneck as the dataset gets larger (thanks to some improvements I did not see), so this is not the bottleneck step. The next step after this PR will be to accelerate PQ with both cuda and cuvs.