Distributed training hangs indefinitely at FP16_Optimizer#step
When I run distributed training with more than one GPU, training gets stuck at the very beginning and hangs indefinitely. It is stuck in FP16_Optimizer#set (specifically at this line, where data is implicitly moved from the GPU to the CPU).
The command line hangs at this indefinitely and makes no progress no matter how I wait:
training: 0%| | 0/1000000 [00:00<?, ?it/s]
I see this issue regardless of which CUDA version I use (I've tried it with 10.0 and 10.1), and regardless of whether I install apex myself or use the docker image icaruszyz/large-scale-training:dialogpt.
I do not experience this issue when I run demo.py rather than using python -m torch.distributed.launch to run training (i.e. I see this issue only when I try to train on multiple GPUs, not on a single GPU). I have not tried training with full 32-bit precision because I want to limit the number of GPUs I have to use.
The fact that this issue only occurs when training with multiple GPUs, and that it occurs on a line which transfers data from the GPU to the CPU, suggests to me that there may be a race condition related to collecting data from multiple GPUs.
Training configuration:
INFO - __main__ - Input Argument Information
INFO - __main__ - model_name_or_path ./configs/762M
INFO - __main__ - seed 42
INFO - __main__ - max_seq_length 128
INFO - __main__ - skip_eval False
INFO - __main__ - init_checkpoint ./models/large/large_fs.pkl
INFO - __main__ - train_input_file ./data/train.128len.db
INFO - __main__ - eval_input_file ./data/dummy_data.tsv
INFO - __main__ - continue_from 0
INFO - __main__ - train_batch_size 8
INFO - __main__ - gradient_accumulation_steps 2
INFO - __main__ - eval_batch_size 16
INFO - __main__ - learning_rate 0.0001
INFO - __main__ - num_optim_steps 1000000
INFO - __main__ - valid_step 10000
INFO - __main__ - warmup_proportion 0.1
INFO - __main__ - warmup_steps 16000
INFO - __main__ - normalize_data True
INFO - __main__ - fp16 True
INFO - __main__ - lr_schedule noam
INFO - __main__ - loss_scale 0
INFO - __main__ - no_token_id True
INFO - __main__ - output_dir models/output_model
INFO - __main__ - log_dir None
INFO - __main__ - pbar True
INFO - __main__ - local_rank 0
INFO - __main__ - config None
INFO - __main__ - device cuda:0
INFO - __main__ - n_gpu 1
This bug is preventing me from fine-tuning the large model, which requires multiple GPUs.
Has anyone else experienced this or found a workaround?
Our model was trained on multiple GPU without issues, it might be a display issue related to progress bar, or different version of apex...
So can you try with a very small num_optim_steps (say 100) and see if it actually train the model or not?