DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG]running step2 reward-model with errors

Open janglichao opened this issue 2 years ago • 1 comments

Describe the bug when script auto run into step2 with error: exits with return code = -9 Traceback (most recent call last): File "/home/kidd/projects/llms/DeepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/train.py", line 210, in main(args) File "/home/kidd/projects/llms/DeepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/train.py", line 195, in main launch_cmd(args, step_num, cmd) File "/home/kidd/projects/llms/DeepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/train.py", line 175, in launch_cmd raise RuntimeError('\n\n'.join(( RuntimeError: Step 2 exited with non-zero status 247

[2023-04-16 08:08:00,497] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 364004 [2023-04-16 08:08:00,520] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 364005 [2023-04-16 08:08:01,554] [ERROR] [launch.py:434:sigkill_handler] ['/home/kidd/anaconda3/bin/python', '-u', 'main.py', '--local_rank=1', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', 'openai/webgpt_comparisons', 'stanfordnlp/SHP', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-350m', '--num_padding_at_beginning', '1', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--max_seq_len', '512', '--learning_rate', '5e-5', '--weight_decay', '0.1', '--num_train_epochs', '1', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '0', '--deepspeed', '--output_dir', '/home/kidd/projects/llms/DeepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m'] exits with return code = -9

here is log files and script: actor-training.log reward-training.log

run-13b.sh: Screenshot from 2023-04-16 11-42-18

To Reproduce Steps to reproduce the behavior: 1.type python train.py --actor-model facebook/opt-13b --reward-model facebook/opt-350m --deployment-type single_node and run

System info (please complete the following information):

  • OS: [e.g. Ubuntu 22]
  • GPU count and types single node with 3090*2
  • (if applicable) what DeepSpeed-MII version are you using
  • (if applicable) Hugging Face Transformers/Accelerate/etc. versions
  • Python version
  • Any other relevant info about your setup

janglichao avatar Apr 16 '23 03:04 janglichao

@janglichao I see in the actor model output: main.py: error: unrecognized arguments: --only_optimizer_lora

Could you please check that you have the latest DeepSpeed (>=0.9.0) and latest changes on the master branch of DeepSpeedExamples repo? Thanks

mrwyattii avatar Apr 17 '23 17:04 mrwyattii