DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] Zero2/3 segmentation fault with CPU optimizer off-loading

Open haixpham opened this issue 2 years ago • 2 comments

Describe the bug Deepspeed got segfault when loading CPU_ADAM, both with zero-2 and zero-3 configs / Huggingface transformers integration.

Zero Configuations

  • Zero-2
{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "bf16": {
        "enabled": "auto"
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}
  • Zero-3
{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "bf16": {
        "enabled": "auto"
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": 0,
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 0,
        "stage3_max_reuse_distance": 0,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Expected behavior CPU_ADAM is built and loaded successfully.

ds_report output

  • On the cluster that doesn't work, with CUDA 11.8. Note that, in the report below deepspeed 0.12.4 was tested. I got the same error with all versions from 0.9.5 to 0.12.4.
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/torch']
torch version .................... 2.1.1
deepspeed install path ........... ['/home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/deepspeed']
deepspeed info ................... 0.12.4, unknown, unknown
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.1, cuda 11.8
shared memory (/dev/shm) size .... 1.00 GB
  • on the cluster that deepspeed works properly, with CUDA 12
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/SERILOCAL/hai.xuanpham/anaconda3/envs/udop/lib/python3.11/site-packages/torch']
torch version .................... 2.1.1
deepspeed install path ........... ['/home/SERILOCAL/hai.xuanpham/anaconda3/envs/udop/lib/python3.11/site-packages/deepspeed']
deepspeed info ................... 0.12.3, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 377.20 GB

Error message

[2023-12-12 10:42:54,556] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.12.4, git-hash=unknown, git-branch=unknown
[2023-12-12 10:42:54,577] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /home/work/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Creating extension directory /home/work/.cache/torch_extensions/py311_cu118/cpu_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/work/.cache/torch_extensions/py311_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
lang: 35555505
mae: 11230058
[1/4] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/torch/include -isystem /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/torch/include/TH -isystem /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/work/.conda/miniconda3/envs/udop/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -c /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
[2/4] c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/torch/include -isystem /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/torch/include/TH -isystem /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/work/.conda/miniconda3/envs/udop/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp -o cpu_adam_impl.o 
Using /home/work/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
[3/4] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/torch/include -isystem /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/torch/include/TH -isystem /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/work/.conda/miniconda3/envs/udop/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
[4/4] c++ cpu_adam.o cpu_adam_impl.o custom_cuda_kernel.cuda.o -shared -lcurand -L/home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 32.8829345703125 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.6282782554626465 seconds
[main1:1148 :0:1148] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:   1148) ====
 0 0x0000000000014420 __funlockfile()  ???:0
=================================
[main1:1149 :0:1149] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:   1149) ====
 0 0x0000000000014420 __funlockfile()  ???:0
=================================

System info (please complete the following information):

  • OS: [e.g. Ubuntu 18.04]
  • GPU count and types: A100 with CUDA 11.8, any number of GPUs
  • Python version: 3.11.5
  • Torch 2.1.1
  • Huggingface transformers 4.34.1, accelerate 0.25.0

Launcher context I run into the same error with both deepspeed and torchrun launchers.

Docker context ngc-pytorch image (CUDA 11.8) with custom Conda environment as above

haixpham avatar Dec 12 '23 11:12 haixpham

@haixpham, can you please try the following

  1. Set pin_memory false in your ds_config
  2. Increase docker shared memory size (/dev/shm_size) of the failing setting from 1gb. I notice the working docker has higher value of 300gb. See #4015.

tjruwase avatar Dec 13 '23 15:12 tjruwase

@tjruwase Thanks for the reply!

I tried your suggestions

  • set pin_memory = false
  • increase container shm to 300Gb

, but ran into the same error.

Is there anything specific about the environment that can affect compiling cpu_adam?

Edit: to be double sure that Nvcc and Cuda toolkit are not corrupted, I install CUDA 11.8 to a local path and run again. The same problem persists.

haixpham avatar Dec 14 '23 11:12 haixpham

Hi @haixpham we encountered the same issue, did you figure out a solution?

protossw512 avatar Jul 17 '24 08:07 protossw512

Hi @haixpham we encountered the same issue, did you figure out a solution?

Do you happen to use CUDA 11.8 version of pytorch? If so, please switch to CUDA 12.1 build. No matter what I tried, deepspeed always runs into segfault with CUDA 11.8 build of pytorch.

haixpham avatar Jul 17 '24 08:07 haixpham

Hi @haixpham we encountered the same issue, did you figure out a solution?

Do you happen to use CUDA 11.8 version of pytorch? If so, please switch to CUDA 12.1 build. No matter what I tried, deepspeed always runs into segfault with CUDA 11.8 build of pytorch.

Thank you for your timely response, we will try different CUDA versions!

protossw512 avatar Jul 17 '24 08:07 protossw512