lance icon indicating copy to clipboard operation
lance copied to clipboard

feat: cuvs acceleration for gpu k-means

Open jacketsj opened this issue 1 year ago • 0 comments

We currently have a pytorch-based k-means implementation for computing IVF centroids. This PR accelerates it with cuVS. This uses a tradeoff of faster iterations/less score improvement per iteration.

By default, this is off, since it's primarily useful for very large datasets where large centroid counts are applicable.

Benchmarking (classic k-means scoring):

  • k=16384 clusters
  • text2image-10M base set (10M image embeddings, 200 dimensions, float32, cosine distance)

Results: Slightly better score @ ~1.5x faster. Speedup gets better with more centroids.

Easy test script & outputs
import numpy as np
from lance.cuvs.kmeans import KMeans as KMeansVS
from lance.torch.kmeans import KMeans
import lance
import time
# Note: This kind of approach performs quite poorly on random data (see https://arxiv.org/abs/2405.18680), so it's only worth testing on a real dataset
ds = lance.dataset("path/to/text2image-dataset") # can also use other medium~large datasets
data = np.stack(ds.to_table()["vector"].to_numpy())
max_iters_base = 10
max_iters_cuvs = 12 # iters using cuvs are much faster, but slightly less precise
metric = "cosine"

cuvs_start_time = time.time()
kmeans_cuvs = KMeansVS(
    CLUSTERS,
    metric=metric,
    max_iters=max_iters_cuvs,
    seed=0,
)
kmeans_cuvs.fit(data)
cuvs_end_time = time.time()

base_start_time = time.time()
kmeans = KMeans(
    CLUSTERS,
    metric=metric,
    max_iters=max_iters_base,
    seed=0,
)
kmeans.fit(data)
base_end_time = time.time()
print(f"score after {max_iters_cuvs} iters of kmeans_cuvs better than {max_iters_base} iters of kmeans by {kmeans.total_distance - kmeans_cuvs.total_distance}")
base_time = base_end_time-base_start_time
cuvs_time = cuvs_end_time-cuvs_start_time
print(f"time to run kmeans: {base_time}s. time to run kmeans_cuvs: {cuvs_time} (speedup: {base_time/cuvs_time}x)")

Output:

score after 12 iters of kmeans_cuvs better than 10 iters of kmeans by 5905.7138671875
time to run kmeans: 86.69116258621216s. time to run kmeans_cuvs: 56.66267776489258 (speedup: 1.5299517425899842x)

Additionally, a new "accelerator" choice has been added: "cuvs". This requires one of the added optional dependencies (cuvs-py3X, X in {9,10,11}). This can replace the two routines for which we already have cuda acceleration: IVF model training (Lloyd's algorithm) and IVF assignments. At sufficiently large centroid counts, this can significantly accelerate these steps, resulting in better e2e time. See below:

results_static_20240906_132311_plot_dataset_sift1m_k_10 Although these plots are near-identical, the "cuvs" accelerated variation took ~18.1s to build e2e, while the "cuda" accelerated variation took ~24.4s.

This speedup persists on larger datasets, although I was mistaken in that PQ assignments are a bigger bottleneck as the dataset gets larger (thanks to some improvements I did not see), so this is not the bottleneck step. The next step after this PR will be to accelerate PQ with both cuda and cuvs.

jacketsj avatar Aug 31 '24 09:08 jacketsj