Something questions about part'Pre-processing'

Open ImmortalCi opened this issue 2 years ago • 1 comments

Hello, thanks for your interesting work!

I'm tring to recomplete COCO Pre-training and I noticed that I need to preprocess the dataset. This is mentioned in the ./COCO-DR/COCO/README.md

But when I follow the instructions in it, Something goes wrong in pre_processing_coco.sh. It calls COCO-DR/COCO/helper/create_train_co_short.py and there's a function called encode_one().

in the line 35&36, item is a Dict but no group, spans key in the Dict. This will cause raise valueKeyError: 'group'

log as follows:

I noticed that there are only four keys in each line of the dataset: 'id','title',"text','metadata' Did I miss some steps before preprocessing? I'm eagerly looking forward to your reply!!! Thanks a lot!

Best regards!

Oct 08 '23 11:10 ImmortalCi

Hi,

Thank you for your interest. We previously used "group" to encode different BEIR dataset names. However, it is not utilized in COCO pretraining. We will address this issue promptly.

Best, Yue

Oct 10 '23 21:10 yueyu1030