Huizhuo Angela Yuan

Results 24 comments of Huizhuo Angela Yuan

> In the paper, it seems that you combined the two steps of reinforcement learning into one step, forming an end-to-end training method. The specific algorithm is shown in the...

First of all, a monotonically decreasing and convex function $\ell$ is required in the algorithm. The value of $\ell(0) = \log(2) \approx 0.6931$. Therefore, the value of this formula is...

> Hi there, great job on the project! > > I'm looking to clarify whether the UCLA-AGI/zephyr-7b-sft-full-SPIN-iter1 model was fine-tuned on top of UCLA-AGI/zephyr-7b-sft-full-SPIN-iter0 or alignment-handbook/zephyr-7b-sft-full. The paper suggests that...

Hi @junkangwu , thanks for your follow-up question. In all iterations, the num_train_epochs parameter is set to 6. This setting is enforced explicitly in the training script. For instance, in...

Yes, and we have updated the scripts and guidelines to reproduce the results in our paper. We specified the version by adding model_revision parameter in config.yaml. Please check our recent...

> if I want to reproduce the results reported in SPIN's paper, should I use revision=ac6e600eefcce74f5e8bae1035d4f66019e93190 and train with the SPIN dataset(iter 0,1,2,3) provided in huggingface? https://github.com/uclaml/SPIN/blob/main/configs/config.yaml

Try setting --num_train_epochs=6. We've uncommented it in the revised finetune.sh script (previously specified in comments). You can still stop at the first few epochs.

> > Try setting --num_train_epochs=6. We've uncommented it in the revised finetune.sh script (previously specified in comments). You can still stop at the first few epochs. > > @angelahzyuan I...

DPO relies on the Bradley-Terry (BT) mode or the more general Plackett-Luce models, matching outcomes of pairwise comparisons directly with an implicit reward model. Therefore, the core DPO methodology does...

> Hello! Thanks for the open-sourced code release. I have been trying to run the fine-tuning with a phi-2 3B model on a 40GB A100 GPU, while running `accelerate launch...