EfficientViM
EfficientViM copied to clipboard
[CVPR 25] Official Implementation (Pytorch) of "EfficientViM: Efficient Vision Mamba with Hidden State Mixer-based State Space Duality"
EfficientViM
EfficientViM: Efficient Vision Mamba with Hidden State Mixer-based State Space Duality
Sanghyeok Lee, Joonmyung Choi, Hyunwoo J. Kim*
Conference on Computer Vision and Pattern Recognition (CVPR), 2025
This repository is an official implementation of the CVPR 2025 paper EfficientViM: Efficient Vision Mamba with Hidden State Mixer-based State Space Duality.
Main Results
Comparison of efficient networks on ImageNet-1K classification.
The family of EfficientViM, marked as red and blue stars, shows the best speed-accuracy trade-offs. ($✝$: with distillation)
Image classification on ImageNet-1K (pretrained models)
| model | resolution | epochs | acc | #params | FLOPs | checkpoint |
|---|---|---|---|---|---|---|
| EfficientViM-M1 | 224x224 | 300 | 72.9 | 6.7M | 239M | EfficientViM_M1_e300.pth |
| EfficientViM-M1 | 224x224 | 450 | 73.5 | 6.7M | 239M | EfficientViM_M1_e450.pth |
| EfficientViM-M2 | 224x224 | 300 | 75.4 | 13.9M | 355M | EfficientViM_M2_e300.pth |
| EfficientViM-M2 | 224x224 | 450 | 75.8 | 13.9M | 355M | EfficientViM_M2_e450.pth |
| EfficientViM-M3 | 224x224 | 300 | 77.6 | 16.6M | 656M | EfficientViM_M3_e300.pth |
| EfficientViM-M3 | 224x224 | 450 | 77.9 | 16.6M | 656M | EfficientViM_M3_e450.pth |
| EfficientViM-M4 | 256x256 | 300 | 79.4 | 19.6M | 1111M | EfficientViM_M4_e300.pth |
| EfficientViM-M4 | 256x256 | 450 | 79.6 | 19.6M | 1111M | EfficientViM_M4_e450.pth |
Image classification on ImageNet-1K with distillation
| model | resolution | epochs | acc | checkpoint |
|---|---|---|---|---|
| EfficientViM-M1 | 224x224 | 300 | 74.6 | EfficientViM_M1_dist.pth |
| EfficientViM-M2 | 224x224 | 300 | 76.7 | EfficientViM_M2_dist.pth |
| EfficientViM-M3 | 224x224 | 300 | 79.1 | EfficientViM_M3_dist.pth |
| EfficientViM-M4 | 256x256 | 300 | 80.7 | EfficientViM_M4_dist.pth |
Getting Started
Installation
# Clone this repository:
git clone https://github.com/mlvlab/EfficientViM.git
cd EfficientViM
# Create and activate the environment
conda create -n EfficientViM python==3.10
conda activate EfficientViM
# Install dependencies
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
pip install -r requirements.txt
Training
To train EfficientViM for classification on ImageNet, run train.sh in classification:
cd classification
sh train.sh <num-gpus> <batch-size-per-gpu> <epochs> <model-name> <imagenet-path> <output-path>
For example, to train EfficientViM-M1 for 450 epochs using 8 GPU (with a total batch size of 2048 calculated as <num-gpus> $\times$ <batch-size-per-gpu>), run:
sh train.sh 8 256 450 EfficientViM_M1 <imagenet-path> <output-path>
Training with distillation
To train EfficientViM with distillation objective of DeiT, run train_dist.sh in classification:
sh train_dist.sh <num-gpus> <batch-size-per-gpu> <model-name> <imagenet-path> <output-path>
Evaluation
To evaluate a pre-trained EfficientViM, run test.sh in classification:
sh test.sh <num-gpus> <model-name> <imagenet-path> <checkpoint-path>
# For evaluation with the model trained with distillation
# sh test_dist.sh <num-gpus> <model-name> <imagenet-path> <checkpoint-path>
Acknowledgements
This repo is built upon Swin, VSSD, SHViT, EfficientViT, and SwiftFormer.
Thanks to the authors for their inspiring works!
Citation
If this work is helpful for your research, please consider citing it.
@inproceedings{lee2025efficientvim,
title={EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality},
author={Lee, Sanghyeok and Choi, Joonmyung and Kim, Hyunwoo J.},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2025}
}