Camille Zhong comments

Results 20 comments of


                                            Camille Zhong

Shouldn't the output of the critic be generated for each individual time step, rather than averaged over the entire sequence?

Hello, @zhangyipin thanks for your questions! Regarding this issue, it is actually controversial by now and the key point is how to define the step. (1) If consider each inference...

[BUG]: Chat train_sft.py SupervisedDataset: TypeError: init() got an unexpected keyword argument 'max_length'

hi @mikeda100 could you share your run command when facing ths error?

[BUG]: Regarding the supervised instructs tuning for Coati

hi, @mynamedaike, for the problem of "Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels", this may happen when you load or tokenize...

[BUG]: metaclass conflict

hii, @kkangjiawei could you please share more information about this problem, like which kind of task you are performing in pretrain.py?

[BUG]: Chat第三步的tokenizer只有一个，如果actor和critic是两个模型呢？

hi @iMountTai The two models can be different as long as the actor is same as the initial model (the one trained in SFT stage), and the critic is same...

How to evaluate the effect of PPO training in coati chat

hi, @guijuzhejiang , since the stage 3 of RLHF uses reinforcement learning (here we use PPO algorithm), its time-consuming and unstability may caused by dataset size and dynamic training progress....

[BUG]: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 514946) of binary:

> Here is my run script： torchrun --standalone --nproc_per_node=8 train_sft.py --pretrain $PRETRAIN --model 'llama' --strategy colossalai_zero2 --log_interval 10 --save_path $SAVE_PATH --dataset $DATASET --batch_size 2 --accimulation_steps 16 --lr 2e-5 --max_datasets_size 512...

[BUG]: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 514946) of binary:

> meet the same problem, have you resolved it ![image](https://user-images.githubusercontent.com/38122102/232370707-57607d6d-3b83-4715-bfce-8f4c3dc00a73.png) @xienan0326 can you provide more information about this error? Actually your exitcode is different from his/her. :)

[BUG]: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 514946) of binary:

> > > meet the same problem, have you resolved it ![image](https://user-images.githubusercontent.com/38122102/232370707-57607d6d-3b83-4715-bfce-8f4c3dc00a73.png) > > > > > > @xienan0326 can you provide more information about this error? Actually your exitcode...

[BUG]: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 514946) of binary:

> > > > > meet the same problem, have you resolved it ![image](https://user-images.githubusercontent.com/38122102/232370707-57607d6d-3b83-4715-bfce-8f4c3dc00a73.png) > > > > > > > > > > > > @xienan0326 can you provide...