VectorDBBench icon indicating copy to clipboard operation
VectorDBBench copied to clipboard

GIST Ground Truth Data Missing

Open wahajali opened this issue 2 years ago • 4 comments

I want to run the Search Performance Test on the GIST dataset. I created a new test, since current workloads don't have GIST as part of the performance test. Currently GIST and SIFT are only used in capacity test.

However, the dataset doesn't contain the ground truth data. It only downloads train.parquet and doesn't download the ground truth data (I believe that would be neighbors.parquet).

wahajali avatar Mar 19 '24 00:03 wahajali

Right. We are considering opening up more datasets in the next release, as well as supporting users with their own local datasets.

Currently GIST and SIFT are only used in capacity test. the dataset doesn't contain the ground truth data. I believe that would be neighbors.parquet

alwayslove2013 avatar Mar 20 '24 10:03 alwayslove2013

Right. We are considering opening up more datasets in the next release, as well as supporting users with their own local datasets.

Currently GIST and SIFT are only used in capacity test. the dataset doesn't contain the ground truth data. I believe that would be neighbors.parquet

@alwayslove2013 Also need this! Any update here? or is there any way to generate neighbor.parquet from the origin gist and sift ground truth files? thx

xinhuitian avatar Jul 11 '24 08:07 xinhuitian

@alwayslove2013 I wanted to ask how we can generate ground truth data. I am using pgvector, and when I remove the index and query the data my understanding is that I should get the GT data. Just to verify this, I tested this on the OpenAI 500K dataset (cosine distance), I found that the there are few mismatches in the GT data that I calculated and the one provided by VectorDBBench. The difference is only in the order, and the set of returned vector is the same. Usually two elements are just swapped.

wahajali avatar Oct 07 '24 15:10 wahajali

This happens when there are ties in the ground truth, there is no guarantee that any specific engine will return ties in a specific order, or even in the same order consistently,

greenhal avatar Oct 07 '24 17:10 greenhal