Wei Xiong comments

Results 49 comments of


                                            Wei Xiong

What's the differences between RAFT and InstructGPT's PPO?

Hi, thanks for your interest! PPO is an on-policy policy-based DRL method, which can be used to achieve a high reward by constructing the task as an MDP. RAFT is...

How to evaluate RAFT?

Thanks for the feedback! We should add these features in the next update soon. For your current need, you may see the main loop of raft in ~line442 of src/lmflow/pipeline/raft_aligner.py...

Multi-GPU operation seems to be problematic

> @shizhediao Today, I tried running the program in an instance that has higher RAM. (But with same number/size of GPU). > > I got pretty similar results. > >...

Issue with Recreating RAFT Llama-7b Lora Benchmarks

Thanks for your interest! Do you start the reward modeling with the LLaMA-SFT-7B? In our experiments, if we start reward modeling from the original llama-7b, we will indeed get 71.64...

Issue with Recreating RAFT Llama-7b Lora Benchmarks

I think one potential issue we recently notice is that the evaluation batch size should be set to 1. A batch size > 1 will lead to a much lower...

Issue with Recreating RAFT Llama-7b Lora Benchmarks

Do you use full training or Lora training? Line 45 of examples/reward_modeling decides the mode of training. For full training, you may need to use a much smaller learning rate...

Issue with Recreating RAFT Llama-7b Lora Benchmarks

> Also is Lora used during the SFT training? In our experiments, we use full training for SFT, and LoRA training for LLaMA-7B. We recently tried out open-llama-3b: full training...

Issue with Recreating RAFT Llama-7b Lora Benchmarks

It seems that the evaluation loss is larger than our results. For instance, for the open-llama-3b experiment, we can achieve an evaluation loss of ~0.49. We also use block size...

Update eval_bench_mark.py

Thanks for the pr! I am busy with some projects as it approaches the final recently... I will get back to you as soon as possible.

Update eval_bench_mark.py

hi, I just updated the evaluation script to support weighted average over different subsets. The current result matches that of the official leaderboard now. Could you re-pull a request accordingly?