vinvl captioning: od_labels and features size not match
I downloaded vinvl captioning data from https://github.com/microsoft/Oscar/blob/master/VinVL_DOWNLOAD.md.
When I run captioning training: I found the size of features (https://github.com/microsoft/Oscar/blob/master/oscar/run_captioning.py#L138) and the size of label_info(https://github.com/microsoft/Oscar/blob/master/oscar/run_captioning.py#L124) don't match; In principle they should both be the size of number of detected objects right?
I tried with old oscar features and those match.
Hi,
The region features do not need to have strict 1-to-1 correspondence with the object labels. In fact, it is acceptable to use different confidence thresholds, or even different models, to extract region features and labels. Btw, if you look at the maximum lengths of the image input and object label input, they are also different.
Do you know which model is used to obtain the label_info of the COCO_Caption dataset?