Camille Zhong

Results 20 comments of Camille Zhong

Hello, @zhangyipin thanks for your questions! Regarding this issue, it is actually controversial by now and the key point is how to define the step. (1) If consider each inference...

hi, @mynamedaike, for the problem of "Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels", this may happen when you load or tokenize...

hii, @kkangjiawei could you please share more information about this problem, like which kind of task you are performing in pretrain.py?

hi @iMountTai The two models can be different as long as the actor is same as the initial model (the one trained in SFT stage), and the critic is same...

hi, @guijuzhejiang , since the stage 3 of RLHF uses reinforcement learning (here we use PPO algorithm), its time-consuming and unstability may caused by dataset size and dynamic training progress....

> Here is my run script: torchrun --standalone --nproc_per_node=8 train_sft.py --pretrain $PRETRAIN --model 'llama' --strategy colossalai_zero2 --log_interval 10 --save_path $SAVE_PATH --dataset $DATASET --batch_size 2 --accimulation_steps 16 --lr 2e-5 --max_datasets_size 512...

> meet the same problem, have you resolved it ![image](https://user-images.githubusercontent.com/38122102/232370707-57607d6d-3b83-4715-bfce-8f4c3dc00a73.png) @xienan0326 can you provide more information about this error? Actually your exitcode is different from his/her. :)

> > > meet the same problem, have you resolved it ![image](https://user-images.githubusercontent.com/38122102/232370707-57607d6d-3b83-4715-bfce-8f4c3dc00a73.png) > > > > > > @xienan0326 can you provide more information about this error? Actually your exitcode...

> > > > > meet the same problem, have you resolved it ![image](https://user-images.githubusercontent.com/38122102/232370707-57607d6d-3b83-4715-bfce-8f4c3dc00a73.png) > > > > > > > > > > > > @xienan0326 can you provide...