Remove the duplicates when the dataset contains identical vectors
This PR addresses the problem of duplicate search results that occur when multiple identical vectors exist in a vector dataset.
A relatively straightforward method is to obtain more search results than needed and remove duplicates from them. However, the problem is that you do not know in advance how many duplicates will be included in the search results, so you may have to perform the search multiple times in some cases.
The method implemented in this PR is to find duplicates when creating the graph index and disable duplicate nodes. The time to create the graph index is slightly longer, but in return, the search performance is not degraded compared to the above method.
You may have wondered how to disable duplicate nodes without degrading search performance. The answer is to isolate duplicate nodes. A node that is determined to be a duplicate node will have all edges starting from it as edges coming back to itself, and the graph will be created so that no other nodes will have edges going to it. When the graph is created in this way, duplicate nodes will not be traversed during the search, and therefore will naturally not be included in the search results.
By default, this feature is disabled, and we have added a deactivate_duplicate_nodes parameter for graph creation, which can be set to true to create a graph with duplicate nodes disabled.
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
I would like to change the base branch to 25.06. What should I do?
@anaruse it looks like we have a couple conflicts here. Do you think we can get this change into 25.08?