DeepSpeed
DeepSpeed copied to clipboard
RuntimeError: Error building extension 'fused_adam'
Hi, I am trying the 'cifar10_deepspeed.py' here on a single node (2x3090). When I run the command below:
deepspeed cifar10_deepspeed.py --deepspeed_config ds_config.json
The bug occurs:
[2023-12-31 16:33:59,636] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-31 16:33:59,892] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-12-31 16:33:59,903] [INFO] [runner.py:571:main] cmd = /root/anaconda3/envs/DS2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None cifar10_deepspeed.py --deepspeed_config ds_config.json
[2023-12-31 16:34:02,182] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-31 16:34:02,409] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-12-31 16:34:02,409] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-12-31 16:34:02,409] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-12-31 16:34:02,409] [INFO] [launch.py:163:main] dist_world_size=2
[2023-12-31 16:34:02,409] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2023-12-31 16:34:05,126] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-31 16:34:05,161] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-31 16:34:05,411] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-12-31 16:34:05,411] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-12-31 16:34:05,438] [INFO] [comm.py:637:init_distributed] cdb=None
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
truck dog car car
[2023-12-31 16:34:12,437] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
dog deer cat car
[2023-12-31 16:34:12,919] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.12.6, git-hash=unknown, git-branch=unknown
[2023-12-31 16:34:12,921] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-12-31 16:34:13,053] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py39_cu121/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /root/anaconda3/envs/DS2/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/include/THC -isystem /root/anaconda3/envs/DS2/include -isystem /root/anaconda3/envs/DS2/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -std=c++17 -c /root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
FAILED: multi_tensor_adam.cuda.o
/root/anaconda3/envs/DS2/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/include/THC -isystem /root/anaconda3/envs/DS2/include -isystem /root/anaconda3/envs/DS2/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -std=c++17 -c /root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
sh: /root/anaconda3/envs/DS2/bin/../lib/libtinfo.so.6: no version information available (required by sh)
cc1plus: fatal error: cuda_runtime.h: No such file or directory
compilation terminated.
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/include/THC -isystem /root/anaconda3/envs/DS2/include -isystem /root/anaconda3/envs/DS2/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DBF16_AVAILABLE -c /root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2100, in _run_ninja_build
subprocess.run(
File "/root/anaconda3/envs/DS2/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/SYH/GithubCode/DeepSpeedExamples/training/cifar/cifar10_deepspeed.py", line 316, in <module>
model_engine, optimizer, trainloader, __ = deepspeed.initialize(
File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/__init__.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 304, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1208, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1285, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__
fused_adam_cuda = FusedAdamBuilder().load()
File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 458, in load
return self.jit_load(verbose)
File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 502, in jit_load
op_module = load(name=self.name,
File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1308, in load
return _jit_compile(
File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1710, in _jit_compile
_write_ninja_file_and_build_library(
File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1823, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2116, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'
Loading extension module fused_adam...
Traceback (most recent call last):
File "/root/SYH/GithubCode/DeepSpeedExamples/training/cifar/cifar10_deepspeed.py", line 316, in <module>
model_engine, optimizer, trainloader, __ = deepspeed.initialize(
File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/__init__.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 304, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1208, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1285, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__
fused_adam_cuda = FusedAdamBuilder().load()
File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 458, in load
return self.jit_load(verbose)
File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 502, in jit_load
op_module = load(name=self.name,
File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1308, in load
return _jit_compile(
File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1736, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2136, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "<frozen importlib._bootstrap>", line 565, in module_from_spec
File "<frozen importlib._bootstrap_external>", line 1173, in create_module
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py39_cu121/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
[2023-12-31 16:34:44,479] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1589354
[2023-12-31 16:34:44,480] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1589355
[2023-12-31 16:34:44,517] [ERROR] [launch.py:321:sigkill_handler] ['/root/anaconda3/envs/DS2/bin/python', '-u', 'cifar10_deepspeed.py', '--local_rank=1', '--deepspeed_config', 'ds_config.json'] exits with return code = 1
And this is my environment info: ds_report
[2023-12-31 16:41:01,186] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/root/anaconda3/envs/DS2/lib/python3.9/site-packages/torch']
torch version .................... 2.1.2+cu121
deepspeed install path ........... ['/root/anaconda3/envs/DS2/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.12.6, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 125.87 GB
pip list
Package Version
------------------------ ----------
annotated-types 0.6.0
certifi 2023.11.17
charset-normalizer 3.3.2
contourpy 1.2.0
cycler 0.12.1
deepspeed 0.12.6
filelock 3.13.1
fonttools 4.47.0
fsspec 2023.12.2
hjson 3.1.0
idna 3.6
importlib-resources 6.1.1
Jinja2 3.1.2
kiwisolver 1.4.5
loguru 0.7.2
MarkupSafe 2.1.3
matplotlib 3.8.2
mpi4py 3.1.4
mpmath 1.3.0
networkx 3.2.1
ninja 1.11.1.1
numpy 1.26.2
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.3.101
nvidia-nvtx-cu12 12.1.105
packaging 23.2
Pillow 10.1.0
pip 23.3.1
psutil 5.9.7
py-cpuinfo 9.0.0
pydantic 2.5.3
pydantic_core 2.14.6
pynvml 11.5.0
pyparsing 3.1.1
python-dateutil 2.8.2
requests 2.31.0
setuptools 68.2.2
six 1.16.0
sympy 1.12
torch 2.1.2
torchvision 0.16.2
tqdm 4.66.1
triton 2.1.0
typing_extensions 4.9.0
urllib3 2.1.0
wheel 0.41.2
zipp 3.17.0
gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
I am new to deepspeed, what should I do? Thank you so much!!