dolly icon indicating copy to clipboard operation
dolly copied to clipboard

fused_adam.so: cannot open shared object file: No such file or directory

Open weifj0212 opened this issue 3 years ago • 11 comments

train code: deepspeed training/trainer.py --per-device-train-batch-size 2 --per-device-eval-batch-size 2 --input-model EleutherAI/pythia-6.9b --local-output-dir output --deepspeed config/ds_z3_bf16_config.json --warmup-steps 0

ImportError: /home/xxxxx/.cache/torch_extensions/py310_cu117/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory

weifj0212 avatar Apr 21 '23 11:04 weifj0212

There is probably an earlier error about shared libraries not being available. Do you have all the additional CUDA libraries installed, like cublas? You can see some package install lines in the training script that show what you need to have installed for your CUDA version

srowen avatar Apr 21 '23 15:04 srowen

I am also hitting this issue.

I have installed nvidia-cublas and nvidia-pyindex.

I think the issue is actually with my deepspeed installation, if I run:

ds_report

I get:

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ubuntu/.local/lib/python3.10/site-packages/torch']
torch version .................... 2.0.0+cu117
deepspeed install path ........... ['/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.9.1, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.5
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7

Which seems to suggest that fused_adam is not installed: fused_adam ............. [NO] ....... [OKAY]

Looking at deepspeeds documentation you should be able to install fused_adam by running: DS_BUILD_FUSED_ADAM=1 pip install deepspeed

However, this doesn't seem to work at all after running:

pip uninstall deepspeed
DS_BUILD_FUSED_ADAM=1 pip install deepspeed
ds_report

I still see: fused_adam ............. [NO] ....... [OKAY]

At a bit of a loss tbh, any help would be greatly appreciated.

Samuel1989 avatar Apr 22 '23 22:04 Samuel1989

Try torch 1.13.1 ? Not sure if 2.0 is causing issues

srowen avatar Apr 22 '23 22:04 srowen

I checked out the deepspeed repository and ran: DS_BUILD_FUSED_ADAM=1 pip3 install .

Then fused_adam ............. [YES] ...... [OKAY]

deepspeed seems to be running OK for me now.

Samuel1989 avatar Apr 22 '23 23:04 Samuel1989

deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

I have installed those libraries, but still the same errors. Please help!

libcublas 11.10.3.66 0 nvidia/label/cuda-11.7.1 libcublas-dev 11.10.3.66 0 nvidia/label/cuda-11.7.1 libcurand 10.2.10.91 0 nvidia/label/cuda-11.7.1 libcurand-dev 10.2.10.91 0 nvidia/label/cuda-11.7.1 libcusolver 11.4.0.1 0 nvidia/label/cuda-11.7.1 libcusolver-dev 11.4.0.1 0 nvidia/label/cuda-11.7.1 libcusparse 11.7.4.91 0 nvidia/label/cuda-11.7.1 libcusparse-dev 11.7.4.91 0 nvidia/label/cuda-11.7.1

weifj0212 avatar Apr 23 '23 07:04 weifj0212

@weifj0212 What is the result of running 'ds_report' in a terminal? Are you able post the results?

Samuel1989 avatar Apr 23 '23 07:04 Samuel1989


DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/home/xxx/anaconda3/lib/python3.10/site-packages/torch'] torch version .................... 1.13.1+cu117 deepspeed install path ........... ['/home/xxx/anaconda3/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.9.0, unknown, unknown torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 11.7 deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

weifj0212 avatar Apr 23 '23 08:04 weifj0212

@weifj0212 it looks like fused_adam isn't installed on your machine: fused_adam ............. [NO] ....... [OKAY]

I loosely followed: https://www.deepspeed.ai/tutorials/advanced-install/#pre-install-deepspeed-ops

To install it, the instructions suggest that running the following should installed fused_adam:

pip uninstall deepspeed
DS_BUILD_FUSED_ADAM=1 pip install deepspeed

However, this didn't work for me; I had to checkout their repo and install it manually (not sure why):

git clone https://github.com/microsoft/DeepSpeed.git
cd DeepSpeed
DS_BUILD_FUSED_ADAM=1 pip3 install .

Once it finished I ran ds_report which then looked like:

fused_adam ............. [YES] ...... [OKAY]

I am a complete novice at all of this so I really don't know if the same will work for you but it might be worth a try?

Samuel1989 avatar Apr 23 '23 09:04 Samuel1989

Thank you very much! fused_adam ............. [YES] ...... [OKAY] but have new error: /home/xxx/.cache/torch_extensions/py310_cu117/utils/utils.so: cannot open shared object file: No such file or directory.

weifj0212 avatar Apr 23 '23 09:04 weifj0212

I think you don't have all the right CUDA libs installed somehow, hard to say.

srowen avatar Apr 23 '23 13:04 srowen

@weifj0212 in the ds_report you posted there is also a "utils" component - I didn't have to install this so I can't say with any certainty but it might be worth installing deepspeed utils too?

DS_BUILD_UTILS=1 DS_BUILD_FUSED_ADAM=1 pip install deepspeed

Samuel1989 avatar Apr 23 '23 17:04 Samuel1989

Thanks very much! I changed CUDA11.7 to CUDA11.6, and run "DS_BUILD_UTILS=1 DS_BUILD_FUSED_ADAM=1 pip install deepspeed", there are not errors.

weifj0212 avatar Apr 24 '23 01:04 weifj0212

Nice. CUDA 11.7 works too (that's what I use) but i suspect something else wasn't compatible in here in the shared libraries. It's tricky.

srowen avatar Apr 24 '23 01:04 srowen

@srowen hello. thank you for your comment. Did you mean 11.7 didnt work for multi node training?

cateto avatar Apr 24 '23 07:04 cateto

No, I mean it does work.

srowen avatar Apr 24 '23 12:04 srowen

No, I mean it does work.

@srowen okay! thanks

cateto avatar Apr 24 '23 23:04 cateto

Tried this on 11.8 cuda with torch 2.0.0 but still get same error - DS_BUILD_UTILS=1 DS_BUILD_FUSED_ADAM=1 pip install deepspeed

Any help?

Error - /home/xxx/.cache/torch_extensions/py310_cu118/utils/utils.so: cannot open shared object file: No such file or directory.

DS_REPORT

--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. #033[92m[OKAY]#033[0m
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
#033[93m [WARNING] #033[0m async_io requires the dev libaio .so object and headers but these were not found.
#033[93m [WARNING] #033[0m async_io: please install the libaio-dev package with apt
#033[93m [WARNING] #033[0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
#033[93m [WARNING] #033[0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
#033[93m [WARNING] #033[0m using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO]....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO]....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference ..[NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.10/site-packages/torch']
torch version .................... 2.0.0
deepspeed install path ........... ['/opt/conda/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.9.2, unknown, unknown
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8

anupam-dewan avatar Jun 10 '23 17:06 anupam-dewan

It tells you the problem: #033[93m [WARNING] #033[0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0

srowen avatar Jun 10 '23 17:06 srowen

Is that not just a warning?

anupam-dewan avatar Jun 10 '23 19:06 anupam-dewan

I think it actually doesn't work here. You're also using deepspeed 0.9.2, and I know we had problems with >= 0.9.0. It could be that or any other differences in your env from what we show in this repo.

srowen avatar Jun 10 '23 19:06 srowen

@anupam-dewan I have the same problems .Have you solved it ?

DS_REPORT

niuhuluzhihao avatar Jun 19 '23 16:06 niuhuluzhihao

I think there are lots of answers here. You haven't said what you are doing or what you tried

srowen avatar Jun 19 '23 18:06 srowen

@Samuel1989 @srowen I do “cd DeepSpeed DS_BUILD_FUSED_ADAM=1 pip3 install .” And I have the error like this:

  File "/usr/lib/python3.10/subprocess.py", line 420, in check_output
      return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
    File "/usr/lib/python3.10/subprocess.py", line 524, in run
      raise CalledProcessError(retcode, process.args,
  subprocess.CalledProcessError: Command '['which', 'x86_64-linux-gnu-g++']' returned non-zero exit status 1.
  [end of output]
   note: This error originates from a subprocess, and is likely not a problem with pip.
   ERROR: Failed building wheel for deepspeed
   Running setup.py clean for deepspeed
   Failed to build deepspeed
   ERROR: Could not build wheels for deepspeed, which is required to install pyproject.toml-based projects

However, I do 'DS_BUILD_UTILS=1 DS_BUILD_FUSED_ADAM=1 pip install deepspeed' . it doesn't have the error above. But when I run my program . it also has errors like "utils/utils.so: cannot open shared object file: No such file or directory" .

And I do "ds_report" The output is

[2023-06-20 11:35:42,863] [INFO] [real_accelerator.py:110:get_accelerator] 
Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/algo/mzh/ChatGLM-6B-main-0615/myenv/lib/python3.10/site-packages/torch']
torch version .................... 1.13.1+cu117
deepspeed install path ........... ['/home/algo/mzh/ChatGLM-6B-main-0615/myenv/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.9.4, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7

This problem has troubled me for a long time, can you help me? Thank you

@weifj0212 it looks like fused_adam isn't installed on your machine: fused_adam ............. [NO] ....... [OKAY]

I loosely followed: https://www.deepspeed.ai/tutorials/advanced-install/#pre-install-deepspeed-ops

To install it, the instructions suggest that running the following should installed fused_adam:

pip uninstall deepspeed
DS_BUILD_FUSED_ADAM=1 pip install deepspeed

However, this didn't work for me; I had to checkout their repo and install it manually (not sure why):

git clone https://github.com/microsoft/DeepSpeed.git
cd DeepSpeed
DS_BUILD_FUSED_ADAM=1 pip3 install .

Once it finished I ran ds_report which then looked like:

fused_adam ............. [YES] ...... [OKAY]

I am a complete novice at all of this so I really don't know if the same will work for you but it might be worth a try?

niuhuluzhihao avatar Jun 20 '23 03:06 niuhuluzhihao

@Samuel1989 @srowen I do “cd DeepSpeed DS_BUILD_FUSED_ADAM=1 pip3 install .” And I have the error like this:

  File "/usr/lib/python3.10/subprocess.py", line 420, in check_output
      return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
    File "/usr/lib/python3.10/subprocess.py", line 524, in run
      raise CalledProcessError(retcode, process.args,
  subprocess.CalledProcessError: Command '['which', 'x86_64-linux-gnu-g++']' returned non-zero exit status 1.
  [end of output]
   note: This error originates from a subprocess, and is likely not a problem with pip.
   ERROR: Failed building wheel for deepspeed
   Running setup.py clean for deepspeed
   Failed to build deepspeed
   ERROR: Could not build wheels for deepspeed, which is required to install pyproject.toml-based projects

However, I do 'DS_BUILD_UTILS=1 DS_BUILD_FUSED_ADAM=1 pip install deepspeed' . it doesn't have the error above. But when I run my program . it also has errors like "utils/utils.so: cannot open shared object file: No such file or directory" .

And I do "ds_report" The output is

[2023-06-20 11:35:42,863] [INFO] [real_accelerator.py:110:get_accelerator] 
Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/algo/mzh/ChatGLM-6B-main-0615/myenv/lib/python3.10/site-packages/torch']
torch version .................... 1.13.1+cu117
deepspeed install path ........... ['/home/algo/mzh/ChatGLM-6B-main-0615/myenv/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.9.4, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7

This problem has troubled me for a long time, can you help me? Thank you

@weifj0212 it looks like fused_adam isn't installed on your machine: fused_adam ............. [NO] ....... [OKAY] I loosely followed: https://www.deepspeed.ai/tutorials/advanced-install/#pre-install-deepspeed-ops To install it, the instructions suggest that running the following should installed fused_adam:

pip uninstall deepspeed
DS_BUILD_FUSED_ADAM=1 pip install deepspeed

However, this didn't work for me; I had to checkout their repo and install it manually (not sure why):

git clone https://github.com/microsoft/DeepSpeed.git
cd DeepSpeed
DS_BUILD_FUSED_ADAM=1 pip3 install .

Once it finished I ran ds_report which then looked like:

fused_adam ............. [YES] ...... [OKAY]

I am a complete novice at all of this so I really don't know if the same will work for you but it might be worth a try?

I have the same problem with you, so have you solved it yet?

Wiselnn570 avatar Nov 27 '23 13:11 Wiselnn570

@Samuel1989 @srowen I do “cd DeepSpeed DS_BUILD_FUSED_ADAM=1 pip3 install .” And I have the error like this:

  File "/usr/lib/python3.10/subprocess.py", line 420, in check_output
      return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
    File "/usr/lib/python3.10/subprocess.py", line 524, in run
      raise CalledProcessError(retcode, process.args,
  subprocess.CalledProcessError: Command '['which', 'x86_64-linux-gnu-g++']' returned non-zero exit status 1.
  [end of output]
   note: This error originates from a subprocess, and is likely not a problem with pip.
   ERROR: Failed building wheel for deepspeed
   Running setup.py clean for deepspeed
   Failed to build deepspeed
   ERROR: Could not build wheels for deepspeed, which is required to install pyproject.toml-based projects

However, I do 'DS_BUILD_UTILS=1 DS_BUILD_FUSED_ADAM=1 pip install deepspeed' . it doesn't have the error above. But when I run my program . it also has errors like "utils/utils.so: cannot open shared object file: No such file or directory" . And I do "ds_report" The output is

[2023-06-20 11:35:42,863] [INFO] [real_accelerator.py:110:get_accelerator] 
Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/algo/mzh/ChatGLM-6B-main-0615/myenv/lib/python3.10/site-packages/torch']
torch version .................... 1.13.1+cu117
deepspeed install path ........... ['/home/algo/mzh/ChatGLM-6B-main-0615/myenv/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.9.4, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7

This problem has troubled me for a long time, can you help me? Thank you

@weifj0212 it looks like fused_adam isn't installed on your machine: fused_adam ............. [NO] ....... [OKAY] I loosely followed: https://www.deepspeed.ai/tutorials/advanced-install/#pre-install-deepspeed-ops To install it, the instructions suggest that running the following should installed fused_adam:

pip uninstall deepspeed
DS_BUILD_FUSED_ADAM=1 pip install deepspeed

However, this didn't work for me; I had to checkout their repo and install it manually (not sure why):

git clone https://github.com/microsoft/DeepSpeed.git
cd DeepSpeed
DS_BUILD_FUSED_ADAM=1 pip3 install .

Once it finished I ran ds_report which then looked like:

fused_adam ............. [YES] ...... [OKAY]

I am a complete novice at all of this so I really don't know if the same will work for you but it might be worth a try?

I have the same problem with you, so have you solved it yet? Me too, have you solved it yet?

Rainbowman0 avatar Jan 02 '24 02:01 Rainbowman0

Please consider trying this:

git clone https://github.com/microsoft/DeepSpeed.git
cd DeepSpeed
DS_BUILD_UTILS=1 DS_BUILD_FUSED_ADAM=1 pip install .

If you encounter a mismatch between cuda and gcc, consider lowering the gcc version and running it again. I hope this helps.

iwannabewater avatar Jul 16 '24 05:07 iwannabewater