tinyknn Better support for storing points in multiple lists

Since 2df6a428cf6bcc4e4a08f15e3f7caef9ce5f4f61 it is possible to store every datapoint in n lists by building with ivf.build(n_probes=n). This increases performance recall/qps quite a lot, but only when going from n=1 to n=2, as seen in the attached figure. Figure_1

The problem is probably that duplicate matches aren't handled well. When calling ctop from ivf, we should somehow tell it about the indices we have already collected so the distance table can focus on telling us about alternative interesting candidates.

One option is to even reuse the query_pq_sse(transformed_data, self.tables, indices, values, True) call by calling it on multiple (transformed_data, tables) pairs while keeping (indices, values) fixed. That way we also would only do rescoring/pass-2 a single time on all candidates retrieved from different lists.

The issue is that (1) the binary heap data structure we use can't recognize duplicates, and (2) the query_pq_sse function only knows the "local" id of a point in a list, not the global id.

To solve (2) we could pass a list with the global ids of all the points considered. This would be some extra overhead for query_pq_sse to pass around, but perhaps not that much. And we wouldn't have to "relabel" the returned ids afterwards.

For (1) we could switch back to using insertion sort, or just try heuristically to remove some of the duplicates the heap is able to find.

Apr 11 '23 18:04 thomasahle

Improved a lot now after adding a new query function. However, there's still an issue where later query_sse calls don't know which ids have already been found, and then as we do more and more build probes, the duplicate points start clouding out other useful points, actually reducing our recall.

plot

Apr 12 '23 02:04 thomasahle

We now use the same priority queue for all probes and check for duplicate ids. While this has made things faster, it unfortunately hasn't change the fast that we still don't get much of a benefit for storing points in multiple lists. plot

Apr 19 '23 00:04 thomasahle

Maybe the issue is that while the recall goes up with more build_probes, the size of the clusters also increase, which makes the query/cluster slower. Currently we don't change the number of clusters as we increase build_probes, but we probably should make it something like sqrt(n * build_probes).

Apr 19 '23 00:04 thomasahle

There are advantages to query_probes that do not exist for build_probes.

As one increases n_clusters (ie reduces the size of the clusters) and increases query_probes, one converges towards searching through a ball centered at the query. Thus, query_probes improves the query-adaptivity of the search process. Searching many small clusters gives better recall than searching a few large clusters: in the latter case, retrieval fails whenever the query and the true nearest neighbor are close yet on opposite sides of a boundary between clusters.

Increasing build_probes does not yield the same query-adaptivity. Instead, the IVF clusters are essentially covering overlapping regions. But there's no way to make a cluster region expand in the direction of a query-point, without also expanding in the opposite direction, because cluster regions are expanded at build-time, not query-time.

Jul 04 '23 16:07 calvinmccarter