CLIP_prefix_caption Metrics of ClipCap's Original Performance

Hello, thank you very much for your work.

In my experiments, I utilized the transformer mapping network with the default settings but I failed to reach the original metrics of the paper.

In more detail, in my experiments I used K=10 constant tokens, 10 prefix length, 8 multi head self attention layers with 8 heads each
, 10 epochs using a batch size of 40 and AdamW as an optimizer. The learning rate and warm up steps are default (2e^-5 , 5000). The image encoder and the decoder are the default (ViT-B/32 and GPT2).

The mentioned metrics of the paper (with respect to the COCO dataset and the Transformer Mapping Network) are ( B4: 33.53% , METEOR: 27.45% , CIDEr: 113% ) in contrast to my metrics which are ( B4 : 71,72% , METEOR : 24.89% , CIDEr: 90.91%), which are significant less than the original.

Lastly, I have to mention that the above experiment is trained on a single GPU and the validation is from the COCO dataset. The evaluation metrics are calculated from the pycocoevalcap repository.

Any ideas on how to reach the original model's performance?

Mar 23 '23 15:03 chmorfop

there may be something wrong with you metrics, because B4 is 71,72%, which is too high

Sep 04 '23 02:09 baiyuting

Do you solve this problem?

Sep 25 '23 12:09 cjc20000323