SPLADE representations on BEIR dataset

Open CosimoRulli opened this issue 2 years ago • 1 comments

Hi, thank you for sharing and maintaining this repo! I am willing to generate the SPLADE representations both for documents and queries for all the datasets in BEIR, similarly to what it is possible to do with the create_anserini script for the MSMARCO dataset. I would like to do it both for splade-cocondenser-ensembledistil and efficient-splade-V-large.

I tried to run the following script,

export PYTHONPATH=$PYTHONPATH:$(pwd)
export SPLADE_CONFIG_NAME="config_splade++_cocondenser_ensembledistil"

for dataset in arguana fiqa nfcorpus quora scidocs scifact trec-covid webis-touche2020 climate-fever dbpedia-entity fever hotpotqa nq
do
    python3 -m splade.beir_eval \
        config.pretrained_no_yamlconfig=true \
        +beir.dataset=$dataset \
        +beir.dataset_path=data/beir \
        config.index_retrieve_batch_size=100
done

but I get NDCG=0.001 on the arguana dataset (then, I stopped the script because I guess that there is something wrong). What I am doing wrong? Also, does this script save the embeddings of each dataset? If not, how can I force it to save them?

Jan 04 '24 14:01 CosimoRulli

Hi @CosimoRulli

Sorry for the late reply! I think the issue is due to not correctly loading the model ckpt. From the README, if you only want to evaluate the model from existing checkpoints, you should add the init line and run:

python3 -m splade.beir_eval \
       init_dict.model_type_or_dir=naver/splade-cocondenser-ensembledistil \
       config.pretrained_no_yamlconfig=true \
       +beir.dataset=$dataset \
       +beir.dataset_path=data/beir \
       config.index_retrieve_batch_size=100

Let me know if that works! Best

Jan 29 '24 10:01 thibault-formal