[BUG]ImportError: /root/.cache/torch_extensions/py310_cu117/fused_adam/fused_adam.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE
Describe the bug runing step2 with script:
deepspeed DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/main.py
--data_split 2,4,4
--model_name_or_path facebook/opt-350m
--num_padding_at_beginning 1
--per_device_train_batch_size 8
--per_device_eval_batch_size 8
--max_seq_len 512
--learning_rate 5e-5
--weight_decay 0.1
--num_train_epochs 1
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--num_warmup_steps 0
--gradient_checkpointing
--seed 1234
--zero_stage 0
--deepspeed
--output_dir /home/kidd/projects/llms/chatGLM-6B/ChatGLM-6B/chatglm_efficient_tuning/DeepSpeedExamples/output
&> /home/kidd/projects/llms/chatGLM-6B/ChatGLM-6B/chatglm_efficient_tuning/DeepSpeedExamples/output/rm_training.log
then got errors:
CalledProcessError: Command '['which', 'c++']' returned non-zero exit status 1.
ImportError: /root/.cache/torch_extensions/py310_cu117/fused_adam/fused_adam.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE