[BUG]running step2 reward-model with errors
Describe the bug
when script auto run into step2 with error:
exits with return code = -9
Traceback (most recent call last):
File "/home/kidd/projects/llms/DeepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/train.py", line 210, in
[2023-04-16 08:08:00,497] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 364004 [2023-04-16 08:08:00,520] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 364005 [2023-04-16 08:08:01,554] [ERROR] [launch.py:434:sigkill_handler] ['/home/kidd/anaconda3/bin/python', '-u', 'main.py', '--local_rank=1', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', 'openai/webgpt_comparisons', 'stanfordnlp/SHP', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-350m', '--num_padding_at_beginning', '1', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--max_seq_len', '512', '--learning_rate', '5e-5', '--weight_decay', '0.1', '--num_train_epochs', '1', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '0', '--deepspeed', '--output_dir', '/home/kidd/projects/llms/DeepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m'] exits with return code = -9
here is log files and script: actor-training.log reward-training.log
run-13b.sh:

To Reproduce Steps to reproduce the behavior: 1.type python train.py --actor-model facebook/opt-13b --reward-model facebook/opt-350m --deployment-type single_node and run
System info (please complete the following information):
- OS: [e.g. Ubuntu 22]
- GPU count and types single node with 3090*2
- (if applicable) what DeepSpeed-MII version are you using
- (if applicable) Hugging Face Transformers/Accelerate/etc. versions
- Python version
- Any other relevant info about your setup
@janglichao I see in the actor model output:
main.py: error: unrecognized arguments: --only_optimizer_lora
Could you please check that you have the latest DeepSpeed (>=0.9.0) and latest changes on the master branch of DeepSpeedExamples repo? Thanks