Camille Zhong
Camille Zhong
Hello, @zhangyipin thanks for your questions! Regarding this issue, it is actually controversial by now and the key point is how to define the step. (1) If consider each inference...
hi @mikeda100 could you share your run command when facing ths error?
hi, @mynamedaike, for the problem of "Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels", this may happen when you load or tokenize...
hii, @kkangjiawei could you please share more information about this problem, like which kind of task you are performing in pretrain.py?
hi @iMountTai The two models can be different as long as the actor is same as the initial model (the one trained in SFT stage), and the critic is same...
hi, @guijuzhejiang , since the stage 3 of RLHF uses reinforcement learning (here we use PPO algorithm), its time-consuming and unstability may caused by dataset size and dynamic training progress....
> Here is my run script: torchrun --standalone --nproc_per_node=8 train_sft.py --pretrain $PRETRAIN --model 'llama' --strategy colossalai_zero2 --log_interval 10 --save_path $SAVE_PATH --dataset $DATASET --batch_size 2 --accimulation_steps 16 --lr 2e-5 --max_datasets_size 512...
> meet the same problem, have you resolved it  @xienan0326 can you provide more information about this error? Actually your exitcode is different from his/her. :)
> > > meet the same problem, have you resolved it  > > > > > > @xienan0326 can you provide more information about this error? Actually your exitcode...
> > > > > meet the same problem, have you resolved it  > > > > > > > > > > > > @xienan0326 can you provide...