fused_adam.so: cannot open shared object file: No such file or directory
train code: deepspeed training/trainer.py --per-device-train-batch-size 2 --per-device-eval-batch-size 2 --input-model EleutherAI/pythia-6.9b --local-output-dir output --deepspeed config/ds_z3_bf16_config.json --warmup-steps 0
ImportError: /home/xxxxx/.cache/torch_extensions/py310_cu117/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
There is probably an earlier error about shared libraries not being available. Do you have all the additional CUDA libraries installed, like cublas? You can see some package install lines in the training script that show what you need to have installed for your CUDA version
I am also hitting this issue.
I have installed nvidia-cublas and nvidia-pyindex.
I think the issue is actually with my deepspeed installation, if I run:
ds_report
I get:
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ubuntu/.local/lib/python3.10/site-packages/torch']
torch version .................... 2.0.0+cu117
deepspeed install path ........... ['/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.9.1, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.5
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7
Which seems to suggest that fused_adam is not installed:
fused_adam ............. [NO] ....... [OKAY]
Looking at deepspeeds documentation you should be able to install fused_adam by running:
DS_BUILD_FUSED_ADAM=1 pip install deepspeed
However, this doesn't seem to work at all after running:
pip uninstall deepspeed
DS_BUILD_FUSED_ADAM=1 pip install deepspeed
ds_report
I still see:
fused_adam ............. [NO] ....... [OKAY]
At a bit of a loss tbh, any help would be greatly appreciated.
Try torch 1.13.1 ? Not sure if 2.0 is causing issues
I checked out the deepspeed repository and ran:
DS_BUILD_FUSED_ADAM=1 pip3 install .
Then
fused_adam ............. [YES] ...... [OKAY]
deepspeed seems to be running OK for me now.
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
I have installed those libraries, but still the same errors. Please help!
libcublas 11.10.3.66 0 nvidia/label/cuda-11.7.1 libcublas-dev 11.10.3.66 0 nvidia/label/cuda-11.7.1 libcurand 10.2.10.91 0 nvidia/label/cuda-11.7.1 libcurand-dev 10.2.10.91 0 nvidia/label/cuda-11.7.1 libcusolver 11.4.0.1 0 nvidia/label/cuda-11.7.1 libcusolver-dev 11.4.0.1 0 nvidia/label/cuda-11.7.1 libcusparse 11.7.4.91 0 nvidia/label/cuda-11.7.1 libcusparse-dev 11.7.4.91 0 nvidia/label/cuda-11.7.1
@weifj0212 What is the result of running 'ds_report' in a terminal? Are you able post the results?
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja ninja .................. [OKAY]
op name ................ installed .. compatible
[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY]
DeepSpeed general environment info: torch install path ............... ['/home/xxx/anaconda3/lib/python3.10/site-packages/torch'] torch version .................... 1.13.1+cu117 deepspeed install path ........... ['/home/xxx/anaconda3/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.9.0, unknown, unknown torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 11.7 deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
@weifj0212 it looks like fused_adam isn't installed on your machine:
fused_adam ............. [NO] ....... [OKAY]
I loosely followed: https://www.deepspeed.ai/tutorials/advanced-install/#pre-install-deepspeed-ops
To install it, the instructions suggest that running the following should installed fused_adam:
pip uninstall deepspeed
DS_BUILD_FUSED_ADAM=1 pip install deepspeed
However, this didn't work for me; I had to checkout their repo and install it manually (not sure why):
git clone https://github.com/microsoft/DeepSpeed.git
cd DeepSpeed
DS_BUILD_FUSED_ADAM=1 pip3 install .
Once it finished I ran ds_report which then looked like:
fused_adam ............. [YES] ...... [OKAY]
I am a complete novice at all of this so I really don't know if the same will work for you but it might be worth a try?
Thank you very much! fused_adam ............. [YES] ...... [OKAY] but have new error: /home/xxx/.cache/torch_extensions/py310_cu117/utils/utils.so: cannot open shared object file: No such file or directory.
I think you don't have all the right CUDA libs installed somehow, hard to say.
@weifj0212 in the ds_report you posted there is also a "utils" component - I didn't have to install this so I can't say with any certainty but it might be worth installing deepspeed utils too?
DS_BUILD_UTILS=1 DS_BUILD_FUSED_ADAM=1 pip install deepspeed
Thanks very much! I changed CUDA11.7 to CUDA11.6, and run "DS_BUILD_UTILS=1 DS_BUILD_FUSED_ADAM=1 pip install deepspeed", there are not errors.
Nice. CUDA 11.7 works too (that's what I use) but i suspect something else wasn't compatible in here in the shared libraries. It's tricky.
@srowen hello. thank you for your comment. Did you mean 11.7 didnt work for multi node training?
No, I mean it does work.
No, I mean it does work.
@srowen okay! thanks
Tried this on 11.8 cuda with torch 2.0.0 but still get same error - DS_BUILD_UTILS=1 DS_BUILD_FUSED_ADAM=1 pip install deepspeed
Any help?
Error - /home/xxx/.cache/torch_extensions/py310_cu118/utils/utils.so: cannot open shared object file: No such file or directory.
DS_REPORT
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. #033[92m[OKAY]#033[0m
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
#033[93m [WARNING] #033[0m async_io requires the dev libaio .so object and headers but these were not found.
#033[93m [WARNING] #033[0m async_io: please install the libaio-dev package with apt
#033[93m [WARNING] #033[0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
#033[93m [WARNING] #033[0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
#033[93m [WARNING] #033[0m using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO]....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO]....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference ..[NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.10/site-packages/torch']
torch version .................... 2.0.0
deepspeed install path ........... ['/opt/conda/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.9.2, unknown, unknown
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8
It tells you the problem:
#033[93m [WARNING] #033[0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
Is that not just a warning?
I think it actually doesn't work here. You're also using deepspeed 0.9.2, and I know we had problems with >= 0.9.0. It could be that or any other differences in your env from what we show in this repo.
@anupam-dewan I have the same problems .Have you solved it ?
DS_REPORT
I think there are lots of answers here. You haven't said what you are doing or what you tried
@Samuel1989 @srowen I do “cd DeepSpeed DS_BUILD_FUSED_ADAM=1 pip3 install .” And I have the error like this:
File "/usr/lib/python3.10/subprocess.py", line 420, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/usr/lib/python3.10/subprocess.py", line 524, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['which', 'x86_64-linux-gnu-g++']' returned non-zero exit status 1.
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for deepspeed
Running setup.py clean for deepspeed
Failed to build deepspeed
ERROR: Could not build wheels for deepspeed, which is required to install pyproject.toml-based projects
However, I do 'DS_BUILD_UTILS=1 DS_BUILD_FUSED_ADAM=1 pip install deepspeed' . it doesn't have the error above. But when I run my program . it also has errors like "utils/utils.so: cannot open shared object file: No such file or directory" .
And I do "ds_report" The output is
[2023-06-20 11:35:42,863] [INFO] [real_accelerator.py:110:get_accelerator]
Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/algo/mzh/ChatGLM-6B-main-0615/myenv/lib/python3.10/site-packages/torch']
torch version .................... 1.13.1+cu117
deepspeed install path ........... ['/home/algo/mzh/ChatGLM-6B-main-0615/myenv/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.9.4, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7
This problem has troubled me for a long time, can you help me? Thank you
@weifj0212 it looks like fused_adam isn't installed on your machine:
fused_adam ............. [NO] ....... [OKAY]I loosely followed: https://www.deepspeed.ai/tutorials/advanced-install/#pre-install-deepspeed-ops
To install it, the instructions suggest that running the following should installed fused_adam:
pip uninstall deepspeed DS_BUILD_FUSED_ADAM=1 pip install deepspeedHowever, this didn't work for me; I had to checkout their repo and install it manually (not sure why):
git clone https://github.com/microsoft/DeepSpeed.git cd DeepSpeed DS_BUILD_FUSED_ADAM=1 pip3 install .Once it finished I ran
ds_reportwhich then looked like:fused_adam ............. [YES] ...... [OKAY]I am a complete novice at all of this so I really don't know if the same will work for you but it might be worth a try?
@Samuel1989 @srowen I do “cd DeepSpeed DS_BUILD_FUSED_ADAM=1 pip3 install .” And I have the error like this:
File "/usr/lib/python3.10/subprocess.py", line 420, in check_output return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, File "/usr/lib/python3.10/subprocess.py", line 524, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['which', 'x86_64-linux-gnu-g++']' returned non-zero exit status 1. [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for deepspeed Running setup.py clean for deepspeed Failed to build deepspeed ERROR: Could not build wheels for deepspeed, which is required to install pyproject.toml-based projectsHowever, I do 'DS_BUILD_UTILS=1 DS_BUILD_FUSED_ADAM=1 pip install deepspeed' . it doesn't have the error above. But when I run my program . it also has errors like "utils/utils.so: cannot open shared object file: No such file or directory" .
And I do "ds_report" The output is
[2023-06-20 11:35:42,863] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) -------------------------------------------------- DeepSpeed C++/CUDA extension op report -------------------------------------------------- NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op. -------------------------------------------------- JIT compiled ops requires ninja ninja .................. [OKAY] -------------------------------------------------- op name ................ installed .. compatible -------------------------------------------------- [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY] -------------------------------------------------- DeepSpeed general environment info: torch install path ............... ['/home/algo/mzh/ChatGLM-6B-main-0615/myenv/lib/python3.10/site-packages/torch'] torch version .................... 1.13.1+cu117 deepspeed install path ........... ['/home/algo/mzh/ChatGLM-6B-main-0615/myenv/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.9.4, unknown, unknown torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 11.8 deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7This problem has troubled me for a long time, can you help me? Thank you
@weifj0212 it looks like fused_adam isn't installed on your machine:
fused_adam ............. [NO] ....... [OKAY]I loosely followed: https://www.deepspeed.ai/tutorials/advanced-install/#pre-install-deepspeed-ops To install it, the instructions suggest that running the following should installed fused_adam:pip uninstall deepspeed DS_BUILD_FUSED_ADAM=1 pip install deepspeedHowever, this didn't work for me; I had to checkout their repo and install it manually (not sure why):
git clone https://github.com/microsoft/DeepSpeed.git cd DeepSpeed DS_BUILD_FUSED_ADAM=1 pip3 install .Once it finished I ran
ds_reportwhich then looked like:fused_adam ............. [YES] ...... [OKAY]I am a complete novice at all of this so I really don't know if the same will work for you but it might be worth a try?
I have the same problem with you, so have you solved it yet?
@Samuel1989 @srowen I do “cd DeepSpeed DS_BUILD_FUSED_ADAM=1 pip3 install .” And I have the error like this:
File "/usr/lib/python3.10/subprocess.py", line 420, in check_output return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, File "/usr/lib/python3.10/subprocess.py", line 524, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['which', 'x86_64-linux-gnu-g++']' returned non-zero exit status 1. [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for deepspeed Running setup.py clean for deepspeed Failed to build deepspeed ERROR: Could not build wheels for deepspeed, which is required to install pyproject.toml-based projectsHowever, I do 'DS_BUILD_UTILS=1 DS_BUILD_FUSED_ADAM=1 pip install deepspeed' . it doesn't have the error above. But when I run my program . it also has errors like "utils/utils.so: cannot open shared object file: No such file or directory" . And I do "ds_report" The output is
[2023-06-20 11:35:42,863] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) -------------------------------------------------- DeepSpeed C++/CUDA extension op report -------------------------------------------------- NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op. -------------------------------------------------- JIT compiled ops requires ninja ninja .................. [OKAY] -------------------------------------------------- op name ................ installed .. compatible -------------------------------------------------- [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY] -------------------------------------------------- DeepSpeed general environment info: torch install path ............... ['/home/algo/mzh/ChatGLM-6B-main-0615/myenv/lib/python3.10/site-packages/torch'] torch version .................... 1.13.1+cu117 deepspeed install path ........... ['/home/algo/mzh/ChatGLM-6B-main-0615/myenv/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.9.4, unknown, unknown torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 11.8 deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7This problem has troubled me for a long time, can you help me? Thank you
@weifj0212 it looks like fused_adam isn't installed on your machine:
fused_adam ............. [NO] ....... [OKAY]I loosely followed: https://www.deepspeed.ai/tutorials/advanced-install/#pre-install-deepspeed-ops To install it, the instructions suggest that running the following should installed fused_adam:pip uninstall deepspeed DS_BUILD_FUSED_ADAM=1 pip install deepspeedHowever, this didn't work for me; I had to checkout their repo and install it manually (not sure why):
git clone https://github.com/microsoft/DeepSpeed.git cd DeepSpeed DS_BUILD_FUSED_ADAM=1 pip3 install .Once it finished I ran
ds_reportwhich then looked like:fused_adam ............. [YES] ...... [OKAY]I am a complete novice at all of this so I really don't know if the same will work for you but it might be worth a try?
I have the same problem with you, so have you solved it yet? Me too, have you solved it yet?
Please consider trying this:
git clone https://github.com/microsoft/DeepSpeed.git
cd DeepSpeed
DS_BUILD_UTILS=1 DS_BUILD_FUSED_ADAM=1 pip install .
If you encounter a mismatch between cuda and gcc, consider lowering the gcc version and running it again. I hope this helps.