DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

RuntimeError: Error building extension 'fused_adam'

Open Rainbowman0 opened this issue 2 years ago • 0 comments

Hi, I am trying the 'cifar10_deepspeed.py' here on a single node (2x3090). When I run the command below:

deepspeed cifar10_deepspeed.py --deepspeed_config ds_config.json

The bug occurs:

[2023-12-31 16:33:59,636] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-31 16:33:59,892] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-12-31 16:33:59,903] [INFO] [runner.py:571:main] cmd = /root/anaconda3/envs/DS2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None cifar10_deepspeed.py --deepspeed_config ds_config.json
[2023-12-31 16:34:02,182] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-31 16:34:02,409] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-12-31 16:34:02,409] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-12-31 16:34:02,409] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-12-31 16:34:02,409] [INFO] [launch.py:163:main] dist_world_size=2
[2023-12-31 16:34:02,409] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2023-12-31 16:34:05,126] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-31 16:34:05,161] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-31 16:34:05,411] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-12-31 16:34:05,411] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-12-31 16:34:05,438] [INFO] [comm.py:637:init_distributed] cdb=None
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
truck   dog   car   car
[2023-12-31 16:34:12,437] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
  dog  deer   cat   car
[2023-12-31 16:34:12,919] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.12.6, git-hash=unknown, git-branch=unknown
[2023-12-31 16:34:12,921] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-12-31 16:34:13,053] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py39_cu121/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /root/anaconda3/envs/DS2/bin/nvcc  -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/include/THC -isystem /root/anaconda3/envs/DS2/include -isystem /root/anaconda3/envs/DS2/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -std=c++17 -c /root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
FAILED: multi_tensor_adam.cuda.o
/root/anaconda3/envs/DS2/bin/nvcc  -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/include/THC -isystem /root/anaconda3/envs/DS2/include -isystem /root/anaconda3/envs/DS2/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -std=c++17 -c /root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
sh: /root/anaconda3/envs/DS2/bin/../lib/libtinfo.so.6: no version information available (required by sh)
cc1plus: fatal error: cuda_runtime.h: No such file or directory
compilation terminated.
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/include/THC -isystem /root/anaconda3/envs/DS2/include -isystem /root/anaconda3/envs/DS2/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DBF16_AVAILABLE -c /root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2100, in _run_ninja_build
    subprocess.run(
  File "/root/anaconda3/envs/DS2/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/SYH/GithubCode/DeepSpeedExamples/training/cifar/cifar10_deepspeed.py", line 316, in <module>
    model_engine, optimizer, trainloader, __ = deepspeed.initialize(
  File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/__init__.py", line 171, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 304, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1208, in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1285, in _configure_basic_optimizer
    optimizer = FusedAdam(
  File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
  File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 458, in load
    return self.jit_load(verbose)
  File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 502, in jit_load
    op_module = load(name=self.name,
  File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1308, in load
    return _jit_compile(
  File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1710, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1823, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2116, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'
Loading extension module fused_adam...
Traceback (most recent call last):
  File "/root/SYH/GithubCode/DeepSpeedExamples/training/cifar/cifar10_deepspeed.py", line 316, in <module>
    model_engine, optimizer, trainloader, __ = deepspeed.initialize(
  File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/__init__.py", line 171, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 304, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1208, in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1285, in _configure_basic_optimizer
    optimizer = FusedAdam(
  File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
  File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 458, in load
    return self.jit_load(verbose)
  File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 502, in jit_load
    op_module = load(name=self.name,
  File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1308, in load
    return _jit_compile(
  File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1736, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2136, in _import_module_from_library
    module = importlib.util.module_from_spec(spec)
  File "<frozen importlib._bootstrap>", line 565, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 1173, in create_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py39_cu121/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
[2023-12-31 16:34:44,479] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1589354
[2023-12-31 16:34:44,480] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1589355
[2023-12-31 16:34:44,517] [ERROR] [launch.py:321:sigkill_handler] ['/root/anaconda3/envs/DS2/bin/python', '-u', 'cifar10_deepspeed.py', '--local_rank=1', '--deepspeed_config', 'ds_config.json'] exits with return code = 1

And this is my environment info: ds_report

[2023-12-31 16:41:01,186] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch']
torch version .................... 2.1.2+cu121
deepspeed install path ........... ['/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.12.6, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 125.87 GB

pip list

Package                  Version
------------------------ ----------
annotated-types          0.6.0
certifi                  2023.11.17
charset-normalizer       3.3.2
contourpy                1.2.0
cycler                   0.12.1
deepspeed                0.12.6
filelock                 3.13.1
fonttools                4.47.0
fsspec                   2023.12.2
hjson                    3.1.0
idna                     3.6
importlib-resources      6.1.1
Jinja2                   3.1.2
kiwisolver               1.4.5
loguru                   0.7.2
MarkupSafe               2.1.3
matplotlib               3.8.2
mpi4py                   3.1.4
mpmath                   1.3.0
networkx                 3.2.1
ninja                    1.11.1.1
numpy                    1.26.2
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12        8.9.2.26
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu12         2.18.1
nvidia-nvjitlink-cu12    12.3.101
nvidia-nvtx-cu12         12.1.105
packaging                23.2
Pillow                   10.1.0
pip                      23.3.1
psutil                   5.9.7
py-cpuinfo               9.0.0
pydantic                 2.5.3
pydantic_core            2.14.6
pynvml                   11.5.0
pyparsing                3.1.1
python-dateutil          2.8.2
requests                 2.31.0
setuptools               68.2.2
six                      1.16.0
sympy                    1.12
torch                    2.1.2
torchvision              0.16.2
tqdm                     4.66.1
triton                   2.1.0
typing_extensions        4.9.0
urllib3                  2.1.0
wheel                    0.41.2
zipp                     3.17.0

gcc --version

gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I am new to deepspeed, what should I do? Thank you so much!!

Rainbowman0 avatar Dec 31 '23 08:12 Rainbowman0