[BUG] Low recall with CAGRA when sparsity or dimensionality is high

Open lokeshwer opened this issue 11 months ago • 1 comments

Describe the bug Low recall with CAGRA when sparsity or dimensionality (n_cols) is high.

Steps/Code to reproduce bug run_filtered_search_test Execute this test against n_cols 1024 and sparsity > 0.95

Expected behavior The recall should be > 0.7

Environment details (please complete the following information):

Environment location: Docker, Azure A100
Method of RAFT install: conda, Docker
- docker pull rapidsai/base:25.04a-cuda12.8-py3.12

Additional context I created a 3 Million vector 1024 dimension embedding from texts for building a RAG system. I created a bitset to ensure that RAG queries are relevant. However, it started to perform poorly on recall. There were cases where without bitset the passages were relevant. I think it is due to high dimensionality and high sparsity. Any ways to circumvent this?

P.S The recall is better in 25.04 over 25.02 as far as I tested.

Mar 13 '25 13:03 lokeshwer

Thank you for reporting the problem. Regarding the issue of recall decreasing as the number of dimensions (n_cols) increases, CAGRA has a search parameter called itopk_size, could you try increasing this parameter? The default value is 64, so if you want to set it to 256, please add itopk_size=256 in the following section. https://github.com/rapidsai/cuvs/blob/bd6d4a9934ad7f3f3cf2e2c6446abdbdf10ffba0/python/cuvs/cuvs/tests/ann_utils.py#L81 Itopk_size is a parameter that roughly corresponds to the number of iterations. In general, if you want to get the same level of recall, you need to increase the number of iterations as the number of dimensions increases.

Apr 07 '25 10:04 anaruse