DeepSpeed [BUG] Zero2/3 segmentation fault with CPU optimizer off-loading

Describe the bug Deepspeed got segfault when loading CPU_ADAM, both with zero-2 and zero-3 configs / Huggingface transformers integration.

Zero Configuations

Zero-2

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "bf16": {
        "enabled": "auto"
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Zero-3

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "bf16": {
        "enabled": "auto"
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": 0,
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 0,
        "stage3_max_reuse_distance": 0,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Expected behavior CPU_ADAM is built and loaded successfully.

ds_report output

On the cluster that doesn't work, with CUDA 11.8. Note that, in the report below deepspeed 0.12.4 was tested. I got the same error with all versions from 0.9.5 to 0.12.4.

DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/torch']
torch version .................... 2.1.1
deepspeed install path ........... ['/home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/deepspeed']
deepspeed info ................... 0.12.4, unknown, unknown
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.1, cuda 11.8
shared memory (/dev/shm) size .... 1.00 GB

on the cluster that deepspeed works properly, with CUDA 12

DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/SERILOCAL/hai.xuanpham/anaconda3/envs/udop/lib/python3.11/site-packages/torch']
torch version .................... 2.1.1
deepspeed install path ........... ['/home/SERILOCAL/hai.xuanpham/anaconda3/envs/udop/lib/python3.11/site-packages/deepspeed']
deepspeed info ................... 0.12.3, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 377.20 GB

Error message

[2023-12-12 10:42:54,556] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.12.4, git-hash=unknown, git-branch=unknown
[2023-12-12 10:42:54,577] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /home/work/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Creating extension directory /home/work/.cache/torch_extensions/py311_cu118/cpu_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/work/.cache/torch_extensions/py311_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
lang: 35555505
mae: 11230058
[1/4] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/torch/include -isystem /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/torch/include/TH -isystem /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/work/.conda/miniconda3/envs/udop/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -c /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
[2/4] c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/torch/include -isystem /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/torch/include/TH -isystem /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/work/.conda/miniconda3/envs/udop/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp -o cpu_adam_impl.o 
Using /home/work/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
[3/4] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/torch/include -isystem /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/torch/include/TH -isystem /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/work/.conda/miniconda3/envs/udop/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
[4/4] c++ cpu_adam.o cpu_adam_impl.o custom_cuda_kernel.cuda.o -shared -lcurand -L/home/work/.conda/miniconda3/envs/udop/lib/python3.11/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 32.8829345703125 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.6282782554626465 seconds
[main1:1148 :0:1148] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:   1148) ====
 0 0x0000000000014420 __funlockfile()  ???:0
=================================
[main1:1149 :0:1149] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:   1149) ====
 0 0x0000000000014420 __funlockfile()  ???:0
=================================

System info (please complete the following information):

OS: [e.g. Ubuntu 18.04]
GPU count and types: A100 with CUDA 11.8, any number of GPUs
Python version: 3.11.5
Torch 2.1.1
Huggingface transformers 4.34.1, accelerate 0.25.0

Launcher context I run into the same error with both deepspeed and torchrun launchers.

Docker context ngc-pytorch image (CUDA 11.8) with custom Conda environment as above

Dec 12 '23 11:12 haixpham

@haixpham, can you please try the following

Set pin_memory false in your ds_config
Increase docker shared memory size (/dev/shm_size) of the failing setting from 1gb. I notice the working docker has higher value of 300gb. See #4015.

Dec 13 '23 15:12 tjruwase

@tjruwase Thanks for the reply!

I tried your suggestions

set pin_memory = false
increase container shm to 300Gb

, but ran into the same error.

Is there anything specific about the environment that can affect compiling cpu_adam?

Edit: to be double sure that Nvcc and Cuda toolkit are not corrupted, I install CUDA 11.8 to a local path and run again. The same problem persists.

Dec 14 '23 11:12 haixpham

Hi @haixpham we encountered the same issue, did you figure out a solution?

Jul 17 '24 08:07 protossw512

Hi @haixpham we encountered the same issue, did you figure out a solution?

Do you happen to use CUDA 11.8 version of pytorch? If so, please switch to CUDA 12.1 build. No matter what I tried, deepspeed always runs into segfault with CUDA 11.8 build of pytorch.

Jul 17 '24 08:07 haixpham

Hi @haixpham we encountered the same issue, did you figure out a solution?

Do you happen to use CUDA 11.8 version of pytorch? If so, please switch to CUDA 12.1 build. No matter what I tried, deepspeed always runs into segfault with CUDA 11.8 build of pytorch.

Thank you for your timely response, we will try different CUDA versions!

Jul 17 '24 08:07 protossw512