DeepSpeed
DeepSpeed copied to clipboard
[BUG] DeepSpeed ZeRO++ features aren't working
Describe the bug DeepSpeed ZeRO++ features aren't working:
- On a single node, passing
zero_hpz_partition_size,zero_quantized_gradients,zero_quantized_weightsleads to foward pass error withBF16. Exact issue reported in https://github.com/microsoft/DeepSpeed/issues/4852. - On single node, passing
zero_hpz_partition_size,zero_quantized_gradientsworks withBF16but I don't notice any speedup at all. - On a single node, passing zero_hpz_partition_size , zero_quantized_gradients , zero_quantized_weights works with FP16 but I don't notice any speedup at all. 4% reduction in memory.
- On multi node (2 nodes), passing
zero_hpz_partition_size,zero_quantized_gradients,zero_quantized_weightsfails withFP16as loss suddenly goes to inf and the scaling factor keeps reducing till 1 post which error is raised. - On multi node (2 nodes), passing
zero_hpz_partition_size,zero_quantized_gradientsfails with BF16 as loss suddenly shoots at the start to 2409 and then goes to inf. - On multi node (4-nodes), with and without Hybrid Sharding (zero_hpz_partition_size: 8):
a. No speedup with Hybrid Sharding
b. Training loss curves are similar in both cases unlike the issue https://github.com/microsoft/DeepSpeed/issues/4851.
c. Eval loss is very high in spite of using the same seed. The only diff is zero_hpz_partition_size: 8. Refer to the screenshot below.
d. When redoing the above experiments for Llama-70B model, following observations: Hybrid Sharding gave below error inspite of decreasing per gpu batch sizes from 4 to 2 to 1.
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'out of memory'
To Reproduce Steps to reproduce the behavior:
- DeepSpeed config: https://github.com/pacman100/DHS-LLM-Workshop/blob/main/chat_assistant/training/configs/ds_config_z3.json. Add the flags
zero_hpz_partition_size,zero_quantized_gradients,zero_quantized_weightsas per the experiment being done. - Accelerate config: https://github.com/pacman100/DHS-LLM-Workshop/blob/main/chat_assistant/training/configs/deepspeed_zeropp_config.yaml
- Launch command: https://github.com/pacman100/DHS-LLM-Workshop/blob/main/chat_assistant/training/run_deepspeed_zeropp.sh.
- launch Command on Multi-Node setup:
#!/bin/bash
#SBATCH --job-name=ift_llama
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1 # Crucial - only 1 task per dist per node!
##SBATCH --mem-per-cpu=11G # Uncomment to enable "mix" use of GPUs across cluster users
#SBATCH --requeue
#SBATCH --gres=gpu:8
#SBATCH --partition=cluster_name
#SBATCH --output=/path/to/temp/logs/%x-%j.out
#SBATCH --err=/path/to/temp/logs/%x-%j.err
set -x -e
# CHANGE HERE THE CONDA EVN AND ANY STARTUP SCRIPTS
source ~/.bashrc
cd /path/to/DHS-LLM-Workshop/chat_assistant/training
git pull
export NCCL_ASYNC_ERROR_HANDLING=1
export WANDB_PROJECT=deepspeed_zeropp
echo "START TIME: $(date)"
# CHANGE TO CUMMULATIVELY LOG OUTPUTS
LOG_PATH="main_log.txt"
GPUS_PER_NODE=8
NNODES=$SLURM_NNODES
NUM_PROCESSES=$(expr $NNODES \* $GPUS_PER_NODE)
# so processes know who to talk to
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=6000
# OTHER LAUNCHERS CAN BE USED HERE
export LAUNCHER="accelerate launch \
--config_file configs/deepspeed_zeropp_config.yaml \
--main_process_ip $MASTER_ADDR \
--main_process_port $MASTER_PORT \
--machine_rank \$SLURM_PROCID \
--num_processes $NUM_PROCESSES \
--num_machines $NNODES \
"
# Note: it is important to escape `$SLURM_PROCID` since we want the srun on each node to evaluate this variable
export PROGRAM="\
train.py \
--seed 100 \
--model_name "meta-llama/Llama-2-70b-hf" \
--dataset_name "HuggingFaceH4/ultrachat_200k" \
--chat_template_format "chatml" \
--add_special_tokens False \
--append_concat_token False \
--splits "train_sft,test_sft" \
--max_seq_len 2048 \
--num_train_epochs 1 \
--logging_steps 5 \
--log_level "info" \
--logging_strategy "steps" \
--evaluation_strategy "epoch" \
--save_strategy "epoch" \
--push_to_hub \
--hub_private_repo True \
--hub_strategy "every_save" \
--bf16 True \
--packing True \
--learning_rate 2e-5 \
--lr_scheduler_type "cosine" \
--weight_decay 0.0 \
--warmup_ratio 0.1 \
--max_grad_norm 1.0 \
--output_dir "llama-sft-ds-multinode-zpp" \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 1 \
--dataset_text_field "content" \
--use_flash_attn True \
--gradient_checkpointing True \
--use_reentrant False
"
export CMD="$LAUNCHER $PROGRAM"
srun --jobid $SLURM_JOBID bash -c "$CMD" 2>&1 | tee -a $LOG_PATH
echo "END TIME: $(date)"
Expected behavior
- Hybrid Sharding
zero_hpz_partition_sizeshould result in a speed up on Multi-node setup (4 nodes experimented above) - Hybrid Sharding
zero_hpz_partition_sizeshould not result in OOM with 70B model on multi-node setup (4 nodes) wherein each node has 8X80Gb GPUs. - Hybrid Sharding
zero_hpz_partition_sizeshould result in same eval loss as normal Z3 finetuning on Multi-node setup (4 nodes experimented above) -
zero_hpz_partition_size,zero_quantized_gradients,zero_quantized_weightsshould work with Mixed Precision training inBF16 - On multi-node,
zero_hpz_partition_size,zero_quantized_gradients,zero_quantized_weightshould work with FP16/BF16 without loss jumping to INF.
ds_report output
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch']
torch version .................... 2.1.2+cu121
deepspeed install path ........... ['/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.12.5, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 999.99 GB
System info (please complete the following information):
- OS: Ubuntu 20.04.6 LTS
- GPU count and types One machine with x8 H100s each
- Python version 3.10.13
Launcher context Accelerate launcher which internally uses the DeepSpeed launcher.