DeepSpeed [BUG] Fail to run the example in/DeepSpeedExamples

I follow the example here: https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat#-quick-start-

But then I got error to train the 1.3B model:

python3 train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_node

File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'
[2023-04-14 08:07:54,793] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 4342
[2023-04-14 08:07:56,313] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 4343
[2023-04-14 08:07:56,314] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python3', '-u', 'main.py', '--local_rank=1', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', 'openai/webgpt_comparisons', 'stanfordnlp/SHP', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-1.3b', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--max_seq_len', '512', '--learning_rate', '9.65e-6', '--weight_decay', '0.1', '--num_train_epochs', '2', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '2', '--deepspeed', '--output_dir', '/root/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] exits with return code = 1

So the error here is: RuntimeError: Error building extension 'fused_adam' I can not resolve this. Please help.

PS: I got two GPU with 16G Mem each. I think the resource is OK.

Apr 14 '23 08:04 4t8dd

@wang700 could you please share the output of ds_report? Thanks

Apr 14 '23 16:04 mrwyattii

@wang700 could you please share the output of ds_report? Thanks I have encountered the same problem on Windows WSL Ubuntu 22.04. ds_report

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 but detected 2.0 [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/home/ubuntu/anaconda3/envs/my_gpt/lib/python3.10/site-packages/torch'] torch version .................... 2.0.0+cu117 deepspeed install path ........... ['/home/ubuntu/anaconda3/envs/my_gpt/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.9.0, unknown, unknown torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 11.5 deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3

Apr 15 '23 01:04 HE1092021037

OK. sure.

Screenshot 2023-04-17 at 9 19 30 AM

Apr 17 '23 01:04 4t8dd

Resolved by installing official CUDA Toolkit . I install another version manually previously.

Apr 17 '23 05:04 4t8dd

Hello! Thanks for the info. I am having similar issues - would you mind shedding some light?


{
 "date": "2023-06-16T19:20:02-0600",
 "dirty": false,
 "error": null,
 "full-revisionid": "db4f43
/users/lisali12/.conda/envs/XXX/lib/python3.10/site-packages/pydantic/_internal/_config.py:261: UserWarning: Valid config keys have changed in V2:
* 'allow_population_by_field_name' has been renamed to 'populate_by_name'
* 'validate_all' has been renamed to 'validate_default'
  warnings.warn(message, UserWarning)
/users/lisali12/.conda/envs/XXX/lib/python3.10/site-packages/pydantic/_internal/_fields.py:126: UserWarning: Field "model_persistence_threshold" has conflict with protected namespace "model_".

You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.
  warnings.warn(
/users/lisali12/.conda/envs/XXX/lib/python3.10/site-packages/pydantic/_internal/_config.py:261: UserWarning: Valid config keys have changed in V2:
* 'validate_all' has been renamed to 'validate_default'
  warnings.warn(message, UserWarning)

{
 "date": "2023-02-14T19:25:14-0800",
 "dirty": false,
 "error": null,
 "full-revisionid": "43a780
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja

{
 "date": "2022-11-05T08:23:29+0100",
 "dirty": false,
 "error": null,
 "full-revisionid": "c5a6e1
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/users/lisali12/.conda/envs/XXX/lib/python3.10/site-packages/torch']
torch version .................... 1.13.1+cu117
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.2
deepspeed install path ........... ['/users/lisali12/.conda/envs/XXX/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.7.4+fe5ddd3, fe5ddd3, main
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7

Jul 20 '23 15:07 lisali12