xHansonx issues

Repositories
Issues
Comments

Results 2 issues of


                                            xHansonx

What to improve further?

Based on the results you showed in gits, it looks like losses from training and valid is diverging instead of converging as number of epoch increasing. So, what's the problem...

[BUG]: bug in training rm with ddp strategy with single machine multi-GPUs!

### 🐛 Describe the bug Code: ------------------------------------------------------------ torchrun --standalone --nproc_per_node=1 train_reward_model.py --dataset Dahoas/rm-static --subset ../../../datasets/Dahoas_rm-static --max_len 512 --model gpt2 --pretrain ../../../gpt2/gpt2-small --lora_rank 0 --max_epochs 1 --batch_size 1 --loss_fn log_sig --test...

bug