[BUG]KeyError: 'attention_mask'
run step3 with:
deepspeed --master_port 12346 DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py
--data_path wangrui6/Zhihu-KOL
--data_split 2,4,4
--actor_model_name_or_path /home/kidd/projects/llms/pretrain_models/ChatGLM-6B/
--critic_model_name_or_path /home/kidd/projects/llms/path_to_rm_checkpoint/
--num_padding_at_beginning 1
--per_device_train_batch_size 4
--per_device_mini_train_batch_size 4
--generation_batch_numbers 1
--ppo_epochs 1
--max_answer_seq_len 256
--max_prompt_seq_len 256
--actor_learning_rate 9.65e-6
--critic_learning_rate 5e-6
--actor_weight_decay 0.1
--critic_weight_decay 0.1
--num_train_epochs 1
--lr_scheduler_type cosine
--gradient_accumulation_steps 1
--num_warmup_steps 100
--deepspeed --seed 1234
--enable_hybrid_engine
--actor_gradient_checkpointing
--critic_gradient_checkpointing
--actor_zero_stage 2
--critic_zero_stage 2
--output_dir /home/kidd/projects/llms/ChatGLM-Efficient-Tuning/examples/ppo_model/
&> /home/kidd/projects/llms/ChatGLM-Efficient-Tuning/examples/ppo_model/training.log
then got errors
/DeepSpeedExamples/applications/DeepSpeed-Chat/ │
│ training/step3_rlhf_finetuning/main.py:521 in tokenizers.Encoding for batch item with inde │
│ 236 │ │ """ │
│ 237 │ │ if isinstance(item, str): │
│ ❱ 238 │ │ │ return self.data[item] │
│ 239 │ │ elif self._encodings is not None: │
│ 240 │ │ │ return self._encodings[item] │
│ 241 │ │ else: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 'attention_mask'
[2023-05-01 18:32:09,958] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1241775
[2023-05-01 18:32:09,958] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1241776
[2023-05-01 18:32:10,014] [ERROR] [launch.py:434:sigkill_handler] ['/home/kidd/anaconda3/bin/python', '-u', 'DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py', '--local_rank=1', '--data_path', 'wangrui6/Zhihu-KOL', '--data_split', '2,4,4', '--actor_model_name_or_path', '/home/kidd/projects/llms/chatGLM-6B/ChatGLM-6B/pretrain_models/ChatGLM-6B/', '--critic_model_name_or_path', '/examples/path_to_rm_checkpoint/', '--num_padding_at_beginning', '1', '--per_device_train_batch_size', '4', '--per_device_mini_train_batch_size', '4', '--generation_batch_numbers', '1', '--ppo_epochs', '1', '--max_answer_seq_len', '256', '--max_prompt_seq_len', '256', '--actor_learning_rate', '9.65e-6', '--critic_learning_rate', '5e-6', '--actor_weight_decay', '0.1', '--critic_weight_decay', '0.1', '--num_train_epochs', '1', '--lr_scheduler_type', 'cosine', '--gradient_accumulation_steps', '1', '--num_warmup_steps', '100', '--deepspeed', '--seed', '1234', '--enable_hybrid_engine', '--actor_gradient_checkpointing', '--critic_gradient_checkpointing', '--actor_zero_stage', '2', '--critic_zero_stage', '2', '--output_dir', '/examples/ppo_model/'] exits with return code = 1
Hi @janglichao, can you please provide more information about your setup?
ds_report output Please run ds_report to give us details about your setup.
System info (please complete the following information):
OS: [e.g. Ubuntu 18.04] GPU count and types [e.g. two machines with x8 A100s each] (if applicable) Hugging Face Transformers/Accelerate/etc. versions Python version Any other relevant info about your setup
Hi @janglichao, can you please provide more information about your setup?
ds_report output Please run ds_report to give us details about your setup.
System info (please complete the following information):
OS: [e.g. Ubuntu 18.04] GPU count and types [e.g. two machines with x8 A100s each] (if applicable) Hugging Face Transformers/Accelerate/etc. versions Python version Any other relevant info about your setup
ds_report:
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja ninja .................. [OKAY]
op name ................ installed .. compatible
async_io ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [YES] ...... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0 [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [YES] ...... [OKAY]
DeepSpeed general environment info: torch install path ............... ['/home/kidd/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch'] torch version .................... 2.0.1+cu117 deepspeed install path ........... ['/home/kidd/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed'] deepspeed info ................... 0.9.3+4d269c6e, 4d269c6e, master torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 12.1 deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7
os:ubuntu22 GPU:3090(24GB)*2 python3.8
@janglichao 我也遇到了这个问题,请问你解决了吗?怎么解决的。