ShuoSIr7 comments

Results 5 comments of


                                            ShuoSIr7

Bug! Help! MS-SWIFT GRPO + LoRA training hung/stuck after training 1 step from full merged model merged from lora adapter

Same issue

Bug! Help! MS-SWIFT GRPO + LoRA training hung/stuck after training 1 step from full merged model merged from lora adapter

> Try the lastest main code please, we fixed this bug yesterday. Hi, thanks for reply, I just found you fixed the seed when vllm_tensor_parallel_size > 1, but this script...

Bug! Help! MS-SWIFT GRPO + LoRA training hung/stuck after training 1 step from full merged model merged from lora adapter

`CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ NPROC_PER_NODE=8 \ swift rlhf \ --rlhf_type grpo \ --model $base_model \ --dataset $train_data \ --output_dir $out_dir \ --num_generations 4 \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 8 \ --per_device_eval_batch_size 2...

Bug! Help! MS-SWIFT GRPO + LoRA training hung/stuck after training 1 step from full merged model merged from lora adapter

> for long sequnces , maybe you can try sequence parallel ok, thanks

训练中途突然报错 NCCL watchdog thread terminated with exception

> [@Jintao-Huang](https://github.com/Jintao-Huang) 同样问题存在于训练超长文本模型，训练部分step报错。训练命令如下： deepspeed --hostfile=/etc/mpi/hostfile swift/cli/sft.py --model $PRETRAIN_MODEL --torch_dtype bfloat16 --train_type full --use_chat_template --dataset $data_path --packing true --num_train_epochs 3 --per_device_train_batch_size $per_node_bsz --data_seed 42 --weight_decay 0.1 --learning_rate 1e-5 --attn_impl flash_attn...