Very different stability with original data augmentation

Open nargenziano opened this issue 3 years ago • 0 comments

I'm trying to pre-train Swin-B on ImageNet-1K on 4 GPUs using the following command.

python -m torch.distributed.launch --nproc_per_node 4 --master_port 12345  main.py \
--cfg configs/swin/swin_base_patch4_window7_224.yaml --data-path /scratch_net/biwidl213_second/nargenziano/imagenet \
--batch-size 128 --accumulation-steps 2 --opts TRAIN.EPOCHS 100 TRAIN.WARMUP_EPOCHS 5 \
--output checkpoints/swin_base_STDAug --use-checkpoint

In particular, I'm trying two different pipelines of data augmentation: the original one and a smaller one. To do so, I edited lines 129-138 of data/builder.py to

transform = create_transform(
            input_size=config.DATA.IMG_SIZE,
            is_training=True,
            color_jitter=0.,
            hflip=0.5,
            mean=IMAGENET_DEFAULT_MEAN,
            std=IMAGENET_DEFAULT_STD)

When running pre-training with this custom pipeline, the test accuracy increases steadily even during the first 9 epochs: Acc@1 5.818 Acc@5 16.394 Acc@1 14.134 Acc@5 31.050 Acc@1 20.776 Acc@5 41.090 Acc@1 25.928 Acc@5 48.310 Acc@1 28.368 Acc@5 51.492 Acc@1 32.326 Acc@5 55.572 Acc@1 36.132 Acc@5 60.238 Acc@1 39.272 Acc@5 63.244 Acc@1 42.422 Acc@5 66.680

Instead, the script using the original pipeline is much more unstable: Acc@1 3.314 Acc@5 10.562 Acc@1 7.490 Acc@5 19.902 Acc@1 12.230 Acc@5 28.590 Acc@1 0.226 Acc@5 0.898 Acc@1 0.914 Acc@5 3.662 Acc@1 4.492 Acc@5 13.118 Acc@1 2.862 Acc@5 8.644 Acc@1 3.082 Acc@5 9.492 Acc@1 4.470 Acc@5 12.968

Below is my environment info:

PyTorch version: 1.8.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 10 (buster) (x86_64)
GCC version: (Debian 8.3.0-6) 8.3.0
Clang version: Could not collect
CMake version: version 3.13.4

Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: NVIDIA GeForce GTX 1080 Ti
GPU 1: NVIDIA GeForce GTX 1080 Ti
GPU 2: NVIDIA GeForce GTX 1080 Ti
GPU 3: NVIDIA GeForce GTX 1080 Ti

Nvidia driver version: 470.141.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.6
[pip3] torch==1.8.0
[pip3] torchvision==0.9.0
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               10.2.89              hfd86e86_1  
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.4.0           h06a4308_640  
[conda] mkl-service               2.4.0            py37h7f8727e_0  
[conda] mkl_fft                   1.3.1            py37hd3c417c_0  
[conda] mkl_random                1.2.2            py37h51133e4_0  
[conda] numpy                     1.21.5           py37h6c91a56_3  
[conda] numpy-base                1.21.5           py37ha15fc14_3  
[conda] pytorch                   1.8.0           py3.7_cuda10.2_cudnn7.6.5_0    pytorch
[conda] torchvision               0.9.0                py37_cu102    pytorch

Nov 16 '22 14:11 nargenziano