Very different stability with original data augmentation
I'm trying to pre-train Swin-B on ImageNet-1K on 4 GPUs using the following command.
python -m torch.distributed.launch --nproc_per_node 4 --master_port 12345 main.py \
--cfg configs/swin/swin_base_patch4_window7_224.yaml --data-path /scratch_net/biwidl213_second/nargenziano/imagenet \
--batch-size 128 --accumulation-steps 2 --opts TRAIN.EPOCHS 100 TRAIN.WARMUP_EPOCHS 5 \
--output checkpoints/swin_base_STDAug --use-checkpoint
In particular, I'm trying two different pipelines of data augmentation: the original one and a smaller one. To do so, I edited lines 129-138 of data/builder.py to
transform = create_transform(
input_size=config.DATA.IMG_SIZE,
is_training=True,
color_jitter=0.,
hflip=0.5,
mean=IMAGENET_DEFAULT_MEAN,
std=IMAGENET_DEFAULT_STD)
When running pre-training with this custom pipeline, the test accuracy increases steadily even during the first 9 epochs: Acc@1 5.818 Acc@5 16.394 Acc@1 14.134 Acc@5 31.050 Acc@1 20.776 Acc@5 41.090 Acc@1 25.928 Acc@5 48.310 Acc@1 28.368 Acc@5 51.492 Acc@1 32.326 Acc@5 55.572 Acc@1 36.132 Acc@5 60.238 Acc@1 39.272 Acc@5 63.244 Acc@1 42.422 Acc@5 66.680
Instead, the script using the original pipeline is much more unstable: Acc@1 3.314 Acc@5 10.562 Acc@1 7.490 Acc@5 19.902 Acc@1 12.230 Acc@5 28.590 Acc@1 0.226 Acc@5 0.898 Acc@1 0.914 Acc@5 3.662 Acc@1 4.492 Acc@5 13.118 Acc@1 2.862 Acc@5 8.644 Acc@1 3.082 Acc@5 9.492 Acc@1 4.470 Acc@5 12.968
Below is my environment info:
PyTorch version: 1.8.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: Debian GNU/Linux 10 (buster) (x86_64)
GCC version: (Debian 8.3.0-6) 8.3.0
Clang version: Could not collect
CMake version: version 3.13.4
Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: NVIDIA GeForce GTX 1080 Ti
GPU 1: NVIDIA GeForce GTX 1080 Ti
GPU 2: NVIDIA GeForce GTX 1080 Ti
GPU 3: NVIDIA GeForce GTX 1080 Ti
Nvidia driver version: 470.141.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.21.6
[pip3] torch==1.8.0
[pip3] torchvision==0.9.0
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.2.89 hfd86e86_1
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-service 2.4.0 py37h7f8727e_0
[conda] mkl_fft 1.3.1 py37hd3c417c_0
[conda] mkl_random 1.2.2 py37h51133e4_0
[conda] numpy 1.21.5 py37h6c91a56_3
[conda] numpy-base 1.21.5 py37ha15fc14_3
[conda] pytorch 1.8.0 py3.7_cuda10.2_cudnn7.6.5_0 pytorch
[conda] torchvision 0.9.0 py37_cu102 pytorch