DeepSpeed [BUG] Grad_norm is nan and Loss is 0

Describe the bug when train llama-vid (stage2, full-finetuning LLaMA) using deepspeed==0.14.0, and transformers trainer, grad_norm will be nan (or 1.414, with smaller lr, pink line) and loss will be 0 after few steps. Same issue as described in #5242, but with AMD GPU. I follow #5242, deepspeed==0.12.3 can work normally. However, both versions of ds do not significantly speed up training when using multiple nodes.

grad_norm, circle means NAN:

Step	Stage2_1node_ds0.14.0
37	1.6166185140609741
38	1.5178347826004028
39	NaN
40	1.434411883354187
41	NaN
42	NaN

Step	Stage2_4node_ds0.14.0
0	23068.85156
1	24.89443588256836
2	22.727699279785156
3	23.45322036743164
4	NaN
5	NaN

Step	Stage2_1node_ds0.14.0_smaller_lr
113	1.500418203016511
114	1.0031307797956182
115	1.4142135623730951
116	1.4142135623730951
117	1.4142135623730951
118	1.4142135623730951

loss:

training speed:

# one node, ds 0.14.0
[default0]:  0%|          | 11/5964 [05:15<46:24:59, 28.07s/it]
[default0]:  0%|          | 12/5964 [05:40<45:14:48, 27.37s/it]
[default0]:  0%|          | 13/5964 [06:05<43:52:20, 26.54s/it]
[default0]:  0%|          | 14/5964 [06:50<53:08:24, 32.15s/it]
[default0]:  0%|          | 15/5964 [07:16<50:11:33, 30.37s/it]

# four nodes, ds 0.14.0
[default0]:  0%|          | 11/5964 [04:11<36:46:44, 22.24s/it]
[default0]:  0%|          | 12/5964 [04:32<36:07:52, 21.85s/it]
[default0]:  0%|          | 13/5964 [04:49<33:38:57, 20.36s/it]
[default0]:  0%|          | 14/5964 [05:20<38:58:16, 23.58s/it]
[default0]:  0%|          | 15/5964 [05:44<39:29:56, 23.90s/it]

(stage1, training Connector, all work normally, inculding speed up training when using multiple nodes and ds version)

training speed:

# one node, ds 0.14.0
[default0]:  0%|          | 11/3086 [01:43<6:47:31,  7.95s/it]
[default0]:  0%|          | 12/3086 [01:51<6:58:28,  8.17s/it]
[default0]:  0%|          | 13/3086 [02:00<7:02:19,  8.25s/it]
[default0]:  0%|          | 14/3086 [02:06<6:33:13,  7.68s/it]
[default0]:  0%|          | 15/3086 [02:14<6:38:44,  7.79s/it]

# four nodes, ds 0.14.0
[default0]:  0%|          | 11/3086 [00:36<2:21:11,  2.76s/it]
[default0]:  0%|          | 12/3086 [00:39<2:27:00,  2.87s/it]
[default0]:  0%|          | 13/3086 [00:41<2:24:27,  2.82s/it]
[default0]:  0%|          | 14/3086 [00:43<2:12:09,  2.58s/it]
[default0]:  0%|          | 15/3086 [00:46<2:11:55,  2.58s/it]

I'm not sure if the training speed is related to issue #5242, but I think it's abnormal because with A100 and multiple nodes, I can achieve a significant speed improvement.

To Reproduce Steps to reproduce the behavior:

my run script, luanch with slurm srun bash scripts/video/train/stage_2_full_v7b_224_fps_1_torchrun.sh:

#!/bin/bash -e
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=12345

export GPUS_PER_NODE=$SLURM_GPUS_ON_NODE
export MIOPEN_DEBUG_DISABLE_SQL_WAL=1
export MIOPEN_USER_DB_PATH="~/.cache/$(whoami)-miopen-cache-$SLURM_NODEID"
export MIOPEN_CUSTOM_CACHE_DIR=$MIOPEN_USER_DB_PATH

# Set MIOpen cache to a temporary folder.
if [ $SLURM_LOCALID -eq 0 ] ; then
    rm -rf $MIOPEN_USER_DB_PATH
    mkdir -p $MIOPEN_USER_DB_PATH
fi
sleep 2

export MPICH_GPU_SUPPORT_ENABLED=1

# Set interfaces to be used by RCCL.
export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3
export NCCL_NET_GDR_LEVEL=3

export LAUNCHER="python -m torch.distributed.run \
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $SLURM_NNODES \
    --node_rank $SLURM_PROCID \
    --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
    --rdzv_backend c10d \
    --max_restarts 0 \
    --tee 3 \
    "

export Stage2_CMD=" \
    llamavid/train/train_mem.py \
    --deepspeed ./scripts/zero2.json \
    --model_name_or_path model_zoo/LLM/llama2/Llama-2-7b-chat-hf \
    --version imgsp_llama_2 \
    --data_path ./data/LLaMA-VID-Finetune/llava_v1_5_mix665k_with_video_chatgpt.json \
    --image_folder ./data/LLaMA-VID-Finetune \
    --video_folder ./data/LLaMA-VID-Finetune \
    --vision_tower model_zoo/LAVIS/eva_vit_g.pth \
    --image_processor ./llamavid/processor/clip-patch14-224 \
    --pretrain_mm_mlp_adapter ./work_dirs/llama2-vid-7b-pretrain-224-video-fps-1/mm_projector.bin \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --video_fps 1 \
    --bert_type "qformer_pretrain" \
    --num_query 32 \
    --compress_type "mean" \
    --bf16 True \
    --output_dir ./work_dirs/llama2-vid-7b-full-224-video-fps-1-torchrun  \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb \
    --run_name LUMI_Stage2_LLaMA
    "

# 8 x 4 x 4 = 128
# 8 x 2 x 8 = 128
# 32 x 2 x 2 = 128

bash -c "$LAUNCHER $Stage2_CMD"

Expected behavior Grad_norm != nan and loss != 0

ds_report output

# ds 0.14.0
[2024-04-02 08:21:58,307] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn is not compatible with ROCM
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/xxx/miniconda3/envs/llamavid_rocm5.6/lib/python3.10/site-packages/torch']
torch version .................... 2.2.2+rocm5.6
deepspeed install path ........... ['/xxx/miniconda3/envs/llamavid_rocm5.6/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.14.0, unknown, unknown
torch cuda version ............... None
torch hip version ................ 5.6.31061-8c743ae5d
nvcc version ..................... None
deepspeed wheel compiled w. ...... torch 2.2, hip 5.6
shared memory (/dev/shm) size .... 427.71 GB

System info (please complete the following information):

OS: SUSE Linux Enterprise Server 15 SP4
GPU count and types: one or four machines with x8 MI250X each
Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
Python version: 3.10.14
transformers==4.39.2

Launcher context slurm -> torch.distributed.run

Additional context following #5242, use ds 0.12.3, One Node: grad_norm: W B Chart 2024_4_2 14_16_49 loss: W B Chart 2024_4_2 14_16_57

Apr 02 '24 06:04 xxtars

Setting overlap_comm to False can avoid this problem.

May 11 '24 08:05 efsotr

Hi @xxtars , we noticed this accuracy issue in 14.0 (some of our user also falled back to 12.3) and did several fixes on accuracy later on. Could you try 0.14.2? thx

May 14 '24 18:05 GuanhuaWang

Setting overlap_comm to False can avoid this problem.

This works in my multi-node training scenario.

Jun 14 '24 03:06 weimakeit

I have checked the problem in 0.15.0 and the problem still exists. Another workaround is to increase the bucket size. For example, an increase from 5e7 to 2e8 can help tackle the problem.

Sep 05 '24 08:09 jianshuod

I have checked the problem in 0.15.0 and the problem still exists. Another workaround is to increase the bucket size. For example, an increase from 5e7 to 2e8 can help tackle the problem.

Hi @jianshuod , is overlap_comm=False works on your side?

Sep 09 '24 23:09 GuanhuaWang

since it has been two weeks, will close it for now. Feel free to reopen if needed

Sep 20 '24 16:09 GuanhuaWang

overlap_comm=False

works for me. deepspeed=0.15.1

Sep 24 '24 02:09 nstl-zyb