mt-captioning

This repository corresponds to the PyTorch implementation of the paper Multimodal Transformer with Multi-View Visual Representation for Image Captioning. By using the bottom-up-attention visual features (with slight improvement), our single-view Multimodal Transformer model (MT_sv) delivers 130.9 CIDEr on the Kapathy's test split of MSCOCO dataset. Please check our paper for details.

Prerequisites
Training
Testing

Prerequisites

Requirements

Python 3
PyTorch >= 1.4.0
Cuda >= 9.2 and cuDNN

The annotation files can be downloaded here and unzipped to the datasets folder.

The visual features are extracted by our bottom-up-attention.pytorch repo using the following scripts:

# 1.extract the bbox from the image
$ python3 extract_features.py --mode caffe \
          --config-file configs/bua-caffe/extract-bua-caffe-r101-bbox-only.yaml \
          --image-dir <image_dir> --out-dir <bbox_dir> --resume

# 2. extract the roi feature by bbox
$ python3 extract_features.py --mode caffe \
         --config-file configs/bua-caffe/extract-bua-caffe-r101-gt-bbox.yaml \
         --image-dir <image_dir> --gt-bbox-dir <bbox_dir> --out-dir <output_dir> --resume

We provided a pre-extracted features in the datasets/mscoco/features/val2014 folder for the image in datasets/mscoco/image to help validating the correctness of the extracted features.

We use the ResNet-101 as our backbone and extract features for the MSCOCO dataset to the datasets/mscoco/features/frcn-r101 folder.

Finally, the datasets folder will have the following structure:

|-- datasets
   |-- mscoco
   |  |-- features
   |  |  |-- frcn-r101
   |  |  |  |-- train2014
   |  |  |  |  |-- COCO_train2014_....jpg.npz
   |  |  |  |-- val2014
   |  |  |  |  |-- COCO_val2014_....jpg.npz
   |  |  |  |-- test2015
   |  |  |  |  |-- COCO_test2015_....jpg.npz
   |  |-- annotations
   |  |  |-- coco-train-idxs.p
   |  |  |-- coco-train-words.p
   |  |  |-- cocotalk_label.h5
   |  |  |-- cocotalk.json
   |  |  |-- vocab.json
   |  |  |-- glove_embeding.npy

Training

The following script will train a model with cross-entropy loss :

$ python train.py --caption_model svbase --ckpt_path <checkpoint_dir> --gpu_id 0

caption_model refers to the model while been trained, such as svbase and umv
ckpt_path refers to the dir to save checkpoint.
gpu_id refers to the gpu id.

Based on the model trained with cross-entropy loss, the following script will load the pre-trained model and then fine-tune the model with self-critical loss:

$ python train.py --caption_model svbase --learning_rate 1e-5 --ckpt_path <checkpoint_dir> --start_from <checkpoint_dir_rl> --gpu_id 0 --max_epochs 25

caption_model refers to the model while been trained.
learning_rate refers to the learning rate use in self-critical.
ckpt_path refers to the dir to save checkpoint.
gpu_id refers to the gpu id.

Testing

Given the trained model, the following script will report the performance on the val split of MSCOCO:

$ python test.py --ckpt_path <checkpoint_dir> --gpu_id 0

ckpt_path refers to the dir to save checkpoint.
gpu_id refers to the gpu id.

Pre-trained models

We provided the pre-trained model for the single-view MT model at present. More models will be added in the future.

Model	Backbone	BLEU@1	METEOR	CIDEr	Download
MT_sv	ResNet-101	80.8	29.1	130.9	model

Citation

If this repository is helpful for your research, we'd really appreciate it if you could cite the following paper:

@article{yu2019multimodal,
  title={Multimodal transformer with multi-view visual representation for image captioning},
  author={Yu, Jun and Li, Jing and Yu, Zhou and Huang, Qingming},
  journal={IEEE Transactions on Circuits and Systems for Video Technology},
  year={2019},
  publisher={IEEE}
}

Acknowledgement

We thank Ruotian Luo for his self-critical.pytorch, cider and coco-caption repos.

mt-captioning
mt-captioning copied to clipboard

Metadata

mt-captioning

Table of Contents

Prerequisites

Requirements

Training

Testing

Pre-trained models

Citation

Acknowledgement

← Metadata

Owner

Metadata

mt-captioning mt-captioning copied to clipboard

Metadata

mt-captioning

Table of Contents

Prerequisites

Requirements

Training

Testing

Pre-trained models

Citation

Acknowledgement

← Metadata

Owner

Metadata

mt-captioning
mt-captioning copied to clipboard