Wei Xiong
Wei Xiong
Hi, thanks for your interest! PPO is an on-policy policy-based DRL method, which can be used to achieve a high reward by constructing the task as an MDP. RAFT is...
Thanks for the feedback! We should add these features in the next update soon. For your current need, you may see the main loop of raft in ~line442 of src/lmflow/pipeline/raft_aligner.py...
> @shizhediao Today, I tried running the program in an instance that has higher RAM. (But with same number/size of GPU). > > I got pretty similar results. > >...
Thanks for your interest! Do you start the reward modeling with the LLaMA-SFT-7B? In our experiments, if we start reward modeling from the original llama-7b, we will indeed get 71.64...
I think one potential issue we recently notice is that the evaluation batch size should be set to 1. A batch size > 1 will lead to a much lower...
Do you use full training or Lora training? Line 45 of examples/reward_modeling decides the mode of training. For full training, you may need to use a much smaller learning rate...
> Also is Lora used during the SFT training? In our experiments, we use full training for SFT, and LoRA training for LLaMA-7B. We recently tried out open-llama-3b: full training...
It seems that the evaluation loss is larger than our results. For instance, for the open-llama-3b experiment, we can achieve an evaluation loss of ~0.49. We also use block size...
Thanks for the pr! I am busy with some projects as it approaches the final recently... I will get back to you as soon as possible.
hi, I just updated the evaluation script to support weighted average over different subsets. The current result matches that of the official leaderboard now. Could you re-pull a request accordingly?