Reduce device memory usage for CAGRA's graph optimization process (2-hop detour counting)
CAGRA takes the initial knn graph as input and optimizes it to create a search graph. Several types of processing are performed in the graph optimization, the most memory-intensive of which is the counting of 2-hop detours.
Currently, the counting of 2-hop detours is performed on the GPU to speed up processing, and this requires that the entire initial knn graph be placed in device memory. In general, the size of the initial knn graph is 2x the size of the search graph. In other words, in the current implementation, roughly half the device memory size is the upper limit of the search graph that can be created. As it is, creating search graphs for huge datasets requires a GPU with a large amount of device memory, which is not practical.
To address this issue, this PR adds a CPU implementation of 2-hop detour counting and uses this CPU implementation to count 2-hop detours when device memory is insufficient.
The CPU implementation supports thread parallelism and is optimized to reduce conditional branches and is sufficiently fast. Of course, it is slower than the GPU implementation, but it can count 2-hop detours in about 3 to 4 times the time of the GPU implementation. Since the time for counting 2-hop detours on GPU is approximately 10% of the total indexing time, the overall time will increase by 20-30% when using the CPU implementation, but this is well within the practical range.
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
/ok to test
/ok to test
@cjnolet, there was an error processing your request: E1
See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/
/merge