ConMIM
ConMIM copied to clipboard
Official codes for ConMIM (ICLR 2023)
Masked Image Modeling with Denoising Contrast
Official PyTorch implementation and pretrained models of "Masked Image Modeling with Denoising Contrast" in International Conference on Learning Representations (ICLR) 2023.

Model Zoo
- We provide the models fine-tuned on ImageNet1k.
| Arch | Epochs | Resolution | Acc@1 | Fine-tuned model |
|---|---|---|---|---|
| ViT-S/16 | 300 | 224x224 | 82.0 | model |
| ViT-B/16 | 800 | 224x224 | 83.7 | model |
| ViT-L/16 | 800 | 224x224 | 85.2 | model |
| ViT-L/16 | 1600 | 224x224 | 85.5 | model |
Results on ImageNet1K

Visualization
Visualize the self-attention map between [CLS] token and local tokens of the pre-trained ViT-B/16 model on ImageNet-1K, where (a) indicates ConMIM pretraining and (b) indicates the vanilla instance-level contrastive pre-training. Self-attention maps out of 12 attention heads are averaged. It can be observed that ConMIM-pretrained models are much more locally discriminative and aware of the visual context.

Setup
Clone the github repo and install the required packages.
git clone https://github.com/TencentARC/ConMIM.git
pip install -r requirements.txt
For mixed-precision training, please install apex
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
Data Preparation
- We use standard ImageNet-1K dataset (http://image-net.org/) for pre-training
- Read from train and val list (download in this link) to boost the speed of reading images from massive small files:
/dataset
└── imagenet1k
├── train
├── val
├── train_map.txt
└── val_map.txt
train_map.txt,val_map.txt: which store the relative path in the corresponding zip file and ground truth label, and can be downloaded in this link.
Pre-training on ImageNet-1K
- We pre-train the ViT-L/16 model with 32 NVIDIA A100 GPUs on ImageNet-1K as follows:
OUTPUT_DIR="./output/conmim_pretrained"
DATA_PATH="./dataset/imagenet1k"
mkdir -p $OUTPUT_DIR
python -m torch.distributed.launch $@ run_conmim_pretraining.py \
--data_path ${DATA_PATH} --output_dir ${OUTPUT_DIR} --mask_ratio 0.75 \
--model conmim_large_patch16_224 \
--batch_size 64 --lr 7.5e-4 --warmup_epochs 10 --epochs 1600 \
--clip_grad 1.0 --drop_path 0 --layer_scale_init_value 1e-5 \
--mask_type 'random_mps32' \
--imagenet_default_mean_and_std \
--save_ckpt_freq 20
Fine-tuning on ImageNet-1K Classification
- We finetune the pre-trained ViT-Base model with 8 NVIDIA A100/V100 GPUs as follows:
CKP="./output/conmim_pretrained/checkpoint_copy-799.pth"
OUTPUT_DIR="./output/conmim_finetuned/"
DATA_PATH="/dataset/imagenet1k/"
mkdir -p $OUTPUT_DIR
python -m torch.distributed.launch --nproc_per_node=8 run_class_finetuning.py \
--model beit_base_patch16_224 --data_path ${DATA_PATH}\
--finetune ${CKP} \
--output_dir ${OUTPUT_DIR} --batch_size 128 --lr 4e-3 --update_freq 1 \
--warmup_epochs 20 --epochs 100 --layer_decay 0.65 --drop_path 0.1 \
--weight_decay 0.05 --mixup 0.8 --cutmix 1.0 --nb_classes 1000 --enable_deepspeed \
--imagenet_default_mean_and_std
Fine-tuning on ADE20K Semantic Segmentation
We follow the BEiT to complete our experiments
Fine-tuning on COCO Detection and Segmentation
We follow the MIMDet to complete our experiments
Acknowledgement
This repository is built using the BEiT repository, the mc-BEiT repository, the timm library, the DeiT repository, and the MIMDet repository.
Citation
If you find our work is useful for your research, please kindly cite our paper.
@article{yi2022masked,
title={Masked image modeling with denoising contrast},
author={Yi, Kun and Ge, Yixiao and Li, Xiaotong and Yang, Shusheng and Li, Dian and Wu, Jianping and Shan, Ying and Qie, Xiaohu},
journal={International Conference on Learning Representations},
year={2023}
}
Contact
If you have any questions, you can contact me from the email: [email protected] or [email protected]