mt-captioning
mt-captioning copied to clipboard
A PyTorch implementation of the paper Multimodal Transformer with Multiview Visual Representation for Image Captioning
mt-captioning
This repository corresponds to the PyTorch implementation of the paper Multimodal Transformer with Multi-View Visual Representation for Image Captioning. By using the bottom-up-attention visual features (with slight improvement), our single-view Multimodal Transformer model (MT_sv) delivers 130.9 CIDEr on the Kapathy's test split of MSCOCO dataset. Please check our paper for details.
Table of Contents
- Prerequisites
- Training
- Testing
Prerequisites
Requirements
The annotation files can be downloaded here and unzipped to the datasets folder.
The visual features are extracted by our bottom-up-attention.pytorch repo using the following scripts:
# 1.extract the bbox from the image
$ python3 extract_features.py --mode caffe \
--config-file configs/bua-caffe/extract-bua-caffe-r101-bbox-only.yaml \
--image-dir <image_dir> --out-dir <bbox_dir> --resume
# 2. extract the roi feature by bbox
$ python3 extract_features.py --mode caffe \
--config-file configs/bua-caffe/extract-bua-caffe-r101-gt-bbox.yaml \
--image-dir <image_dir> --gt-bbox-dir <bbox_dir> --out-dir <output_dir> --resume
We provided a pre-extracted features in the datasets/mscoco/features/val2014 folder for the image in datasets/mscoco/image to help validating the correctness of the extracted features.
We use the ResNet-101 as our backbone and extract features for the MSCOCO dataset to the datasets/mscoco/features/frcn-r101 folder.
Finally, the datasets folder will have the following structure:
|-- datasets
|-- mscoco
| |-- features
| | |-- frcn-r101
| | | |-- train2014
| | | | |-- COCO_train2014_....jpg.npz
| | | |-- val2014
| | | | |-- COCO_val2014_....jpg.npz
| | | |-- test2015
| | | | |-- COCO_test2015_....jpg.npz
| |-- annotations
| | |-- coco-train-idxs.p
| | |-- coco-train-words.p
| | |-- cocotalk_label.h5
| | |-- cocotalk.json
| | |-- vocab.json
| | |-- glove_embeding.npy
Training
The following script will train a model with cross-entropy loss :
$ python train.py --caption_model svbase --ckpt_path <checkpoint_dir> --gpu_id 0
-
caption_modelrefers to the model while been trained, such assvbaseandumv -
ckpt_pathrefers to the dir to save checkpoint. -
gpu_idrefers to the gpu id.
Based on the model trained with cross-entropy loss, the following script will load the pre-trained model and then fine-tune the model with self-critical loss:
$ python train.py --caption_model svbase --learning_rate 1e-5 --ckpt_path <checkpoint_dir> --start_from <checkpoint_dir_rl> --gpu_id 0 --max_epochs 25
-
caption_modelrefers to the model while been trained. -
learning_raterefers to the learning rate use in self-critical. -
ckpt_pathrefers to the dir to save checkpoint. -
gpu_idrefers to the gpu id.
Testing
Given the trained model, the following script will report the performance on the val split of MSCOCO:
$ python test.py --ckpt_path <checkpoint_dir> --gpu_id 0
-
ckpt_pathrefers to the dir to save checkpoint. -
gpu_idrefers to the gpu id.
Pre-trained models
We provided the pre-trained model for the single-view MT model at present. More models will be added in the future.
| Model | Backbone | BLEU@1 | METEOR | CIDEr | Download |
|---|---|---|---|---|---|
| MT_sv | ResNet-101 | 80.8 | 29.1 | 130.9 | model |
Citation
If this repository is helpful for your research, we'd really appreciate it if you could cite the following paper:
@article{yu2019multimodal,
title={Multimodal transformer with multi-view visual representation for image captioning},
author={Yu, Jun and Li, Jing and Yu, Zhou and Huang, Qingming},
journal={IEEE Transactions on Circuits and Systems for Video Technology},
year={2019},
publisher={IEEE}
}
Acknowledgement
We thank Ruotian Luo for his self-critical.pytorch, cider and coco-caption repos.