[BUG] DeepSpeed triggers Pytorch some sort of internal error: torch._C._EngineBase returned NULL without setting an error

Open yt2639 opened this issue 2 years ago • 0 comments

Describe the bug When using deepspeed, during model.backward(loss) step, it reports "SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7f1576e03330> returned NULL without setting an error"

It was working somehow before, but today it is just not working. I have tried several combinations with different python versions, pytorch versions, deepspeed versions and all failed with the same error.

I read an issue here or in pytorch's github (I forgot), and it mentions things about Nvidia's apex. However, I did not install apex and I specifically did pip uninstall apex. So it should not be Nvidia's apex's problem.

reproduce

model = Model(.......)
model.zero_grad()
model.micro_steps = 0
#### train batch for loop
for input in inputs:
        output = model(input)
        loss = criterion(output, labels)
        model.backward(loss)
        mode.step()

ds_report output

[2023-06-16 18:47:30,024] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['${HOME}/anaconda3/envs/ds12/lib/python3.9/site-packages/torch']
torch version .................... 2.0.1+cu118
deepspeed install path ........... ['${HOME}/anaconda3/envs/ds12/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.9.4, unknown, unknown
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8

Error stack

File "train.py", line 880, in train_few_shot
    train_epoch(                                                                                                                                                                                          
File "train.py", line 335, in train_epoch
    model.backward(loss)               
File "${HOME}/anaconda3/envs/ds12/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)    
File "${HOME}/anaconda3/envs/ds12/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1862, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)                                  
File "${HOME}/anaconda3/envs/ds12/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
File "${HOME}/anaconda3/envs/ds12/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1994, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "${HOME}/anaconda3/envs/ds12/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
File "${HOME}/anaconda3/envs/ds12/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
File "${HOME}/anaconda3/envs/ds12/lib/python3.9/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7f1576e03330> returned NULL without setting an error

DeepSpeed Config File

ds_config = {
            "train_batch_size": 64, # my gradient accumulation step = 8, I have 8 GPUs and every gpu has batch size of 1
            "train_micro_batch_size_per_gpu": 1,
            "steps_per_print": 1000,
            "optimizer": {
                "type": "Adam",
                # "torch_adam": True,
                "adam_w_mode": True
                "params": {
                    "lr": 5e-5
                    "weight_decay": 1e-3
                    "bias_correction": True,
                    "betas": [
                        0.9,
                        0.999
                    ],
                    "eps": 1e-8
                }
            },
            "fp16": {
                "enabled": True,
                "loss_scale": 0,
                "initial_scale_power": 7,
                "loss_scale_window": 128,
                "auto_cast": False,
            },
            "zero_optimization": {
                "stage": 3,
                "offload_optimizer": {"device": "cpu"},
                "offload_param": {"device": "cpu"},
                "overlap_comm": True,
                "contiguous_gradients": True,
                "sub_group_size": 1e9,
                "reduce_bucket_size": "auto",
                "stage3_prefetch_bucket_size": "auto",
                "stage3_param_persistence_threshold": "auto",
                "stage3_max_live_parameters": 1e9,
                "stage3_max_reuse_distance": 1e9,
                "stage3_gather_16bit_weights_on_model_save": True
                },
        }

System info (please complete the following information):

Linux OS: Ubuntu 20.04.6 LTS
8x A5000 (24GB) GPUs ** I've tried this with many other combinations, but I will list the main versions I am using for the major modules:**
Pytorch 2.0.0+cu118
deepspeed 0.9.4 (pip install, I also tried compile from source, does not work either)
Python 3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:39:03)
[GCC 11.3.0] on linux

Launcher context I use torch.multiprocessing.spawn. So, I gave dist_init_required=False when I call ds_init.

Thanks! Shane

Jun 16 '23 22:06 yt2639