[BUG] DeepSpeed ZeRO++ features aren't working

Open pacman100 opened this issue 2 years ago • 0 comments

Describe the bug DeepSpeed ZeRO++ features aren't working:

On a single node, passing zero_hpz_partition_size , zero_quantized_gradients , zero_quantized_weights leads to foward pass error with BF16. Exact issue reported in https://github.com/microsoft/DeepSpeed/issues/4852.
On single node, passing zero_hpz_partition_size, zero_quantized_gradients works with BF16 but I don't notice any speedup at all.
On a single node, passing zero_hpz_partition_size , zero_quantized_gradients , zero_quantized_weights works with FP16 but I don't notice any speedup at all. 4% reduction in memory.
On multi node (2 nodes), passing zero_hpz_partition_size, zero_quantized_gradients, zero_quantized_weights fails with FP16 as loss suddenly goes to inf and the scaling factor keeps reducing till 1 post which error is raised.
On multi node (2 nodes), passing zero_hpz_partition_size, zero_quantized_gradients fails with BF16 as loss suddenly shoots at the start to 2409 and then goes to inf.
On multi node (4-nodes), with and without Hybrid Sharding (zero_hpz_partition_size: 8): a. No speedup with Hybrid Sharding b. Training loss curves are similar in both cases unlike the issue https://github.com/microsoft/DeepSpeed/issues/4851. c. Eval loss is very high in spite of using the same seed. The only diff is zero_hpz_partition_size: 8. Refer to the screenshot below. d. When redoing the above experiments for Llama-70B model, following observations: Hybrid Sharding gave below error inspite of decreasing per gpu batch sizes from 4 to 2 to 1.

torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'out of memory'

To Reproduce Steps to reproduce the behavior:

DeepSpeed config: https://github.com/pacman100/DHS-LLM-Workshop/blob/main/chat_assistant/training/configs/ds_config_z3.json. Add the flags zero_hpz_partition_size , zero_quantized_gradients , zero_quantized_weights as per the experiment being done.
Accelerate config: https://github.com/pacman100/DHS-LLM-Workshop/blob/main/chat_assistant/training/configs/deepspeed_zeropp_config.yaml
Launch command: https://github.com/pacman100/DHS-LLM-Workshop/blob/main/chat_assistant/training/run_deepspeed_zeropp.sh.
launch Command on Multi-Node setup:

#!/bin/bash
#SBATCH --job-name=ift_llama
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1                 # Crucial - only 1 task per dist per node!
##SBATCH --mem-per-cpu=11G                  # Uncomment to enable "mix" use of GPUs across cluster users
#SBATCH --requeue
#SBATCH --gres=gpu:8
#SBATCH --partition=cluster_name
#SBATCH --output=/path/to/temp/logs/%x-%j.out
#SBATCH --err=/path/to/temp/logs/%x-%j.err

set -x -e

# CHANGE HERE THE CONDA EVN AND ANY STARTUP SCRIPTS
source ~/.bashrc
cd /path/to/DHS-LLM-Workshop/chat_assistant/training
git pull

export NCCL_ASYNC_ERROR_HANDLING=1
export WANDB_PROJECT=deepspeed_zeropp
echo "START TIME: $(date)"
# CHANGE TO CUMMULATIVELY LOG OUTPUTS
LOG_PATH="main_log.txt"

GPUS_PER_NODE=8
NNODES=$SLURM_NNODES
NUM_PROCESSES=$(expr $NNODES \* $GPUS_PER_NODE)

# so processes know who to talk to
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=6000

# OTHER LAUNCHERS CAN BE USED HERE
export LAUNCHER="accelerate launch \
    --config_file configs/deepspeed_zeropp_config.yaml \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    --machine_rank \$SLURM_PROCID \
    --num_processes $NUM_PROCESSES \
    --num_machines $NNODES \
    "
# Note: it is important to escape `$SLURM_PROCID` since we want the srun on each node to evaluate this variable

export PROGRAM="\
train.py \
--seed 100 \
--model_name "meta-llama/Llama-2-70b-hf" \
--dataset_name "HuggingFaceH4/ultrachat_200k" \
--chat_template_format "chatml" \
--add_special_tokens False \
--append_concat_token False \
--splits "train_sft,test_sft" \
--max_seq_len 2048 \
--num_train_epochs 1 \
--logging_steps 5 \
--log_level "info" \
--logging_strategy "steps" \
--evaluation_strategy "epoch" \
--save_strategy "epoch" \
--push_to_hub \
--hub_private_repo True \
--hub_strategy "every_save" \
--bf16 True \
--packing True \
--learning_rate 2e-5 \
--lr_scheduler_type "cosine" \
--weight_decay 0.0 \
--warmup_ratio 0.1 \
--max_grad_norm 1.0 \
--output_dir "llama-sft-ds-multinode-zpp" \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 1 \
--dataset_text_field "content" \
--use_flash_attn True \
--gradient_checkpointing True \
--use_reentrant False
"


export CMD="$LAUNCHER $PROGRAM"
srun --jobid $SLURM_JOBID bash -c "$CMD" 2>&1 | tee -a $LOG_PATH
echo "END TIME: $(date)"

Expected behavior

Hybrid Sharding zero_hpz_partition_size should result in a speed up on Multi-node setup (4 nodes experimented above)
Hybrid Sharding zero_hpz_partition_size should not result in OOM with 70B model on multi-node setup (4 nodes) wherein each node has 8X80Gb GPUs.
Hybrid Sharding zero_hpz_partition_size should result in same eval loss as normal Z3 finetuning on Multi-node setup (4 nodes experimented above)
zero_hpz_partition_size , zero_quantized_gradients , zero_quantized_weights should work with Mixed Precision training in BF16
On multi-node, zero_hpz_partition_size , zero_quantized_gradients , zero_quantized_weightshould work with FP16/BF16 without loss jumping to INF.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch']
torch version .................... 2.1.2+cu121
deepspeed install path ........... ['/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.12.5, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 999.99 GB

System info (please complete the following information):

OS: Ubuntu 20.04.6 LTS
GPU count and types One machine with x8 H100s each
Python version 3.10.13

Launcher context Accelerate launcher which internally uses the DeepSpeed launcher.

Dec 29 '23 07:12 pacman100