COCO-DR icon indicating copy to clipboard operation
COCO-DR copied to clipboard

Something questions about part'Pre-processing'

Open ImmortalCi opened this issue 2 years ago • 1 comments

Hello, thanks for your interesting work!

I'm tring to recomplete COCO Pre-training and I noticed that I need to preprocess the dataset. This is mentioned in the ./COCO-DR/COCO/README.md image

But when I follow the instructions in it, Something goes wrong in pre_processing_coco.sh. It calls COCO-DR/COCO/helper/create_train_co_short.py and there's a function called encode_one().

in the line 35&36, item is a Dict but no group, spans key in the Dict. This will cause raise valueKeyError: 'group'

image

log as follows: image

I noticed that there are only four keys in each line of the dataset: 'id','title',"text','metadata' Did I miss some steps before preprocessing? I'm eagerly looking forward to your reply!!! Thanks a lot!

Best regards!

ImmortalCi avatar Oct 08 '23 11:10 ImmortalCi

Hi,

Thank you for your interest. We previously used "group" to encode different BEIR dataset names. However, it is not utilized in COCO pretraining. We will address this issue promptly.

Best, Yue

yueyu1030 avatar Oct 10 '23 21:10 yueyu1030