LearnDeepSpeed icon indicating copy to clipboard operation
LearnDeepSpeed copied to clipboard

deepspeed OVERFLOW!

Open Xiaoni-61 opened this issue 1 year ago • 2 comments

[2024-11-04 11:41:27,602] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scal e: 8192, but hysteresis is 2. Reducing hysteresis to 1 [2024-11-04 11:41:27,623] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scal e: 8192, reducing to 4096 [2024-11-04 11:43:36,061] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scal e: 8192, but hysteresis is 2. Reducing hysteresis to 1 [2024-11-04 11:43:39,575] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scal e: 8192, reducing to 4096 [2024-11-04 11:45:48,644] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scal e: 8192, but hysteresis is 2. Reducing hysteresis to 1 [2024-11-04 11:45:51,164] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scal e: 8192, reducing to 4096 [2024-11-04 11:48:02,638] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8192, but hysteresis is 2. Reducing hysteresis to 1 [2024-11-04 11:48:02,902] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8192, reducing to 4096

为什么会这样呢 怎么解决 我的机器是8卡3090

Xiaoni-61 avatar Nov 04 '24 11:11 Xiaoni-61

@Xiaoni-61 等效总batch数大于数据集长度?

bobo0810 avatar Nov 05 '24 06:11 bobo0810

fp 16溢出,改成bf16就好了

ldlbest avatar Dec 01 '24 06:12 ldlbest