Hao

Results 10 comments of Hao

> > 8k训练,可以外推32k > > 请问8k训练的时候base就是1000000吗? Hope this issue could be reopen.

同问,我看代码应该是用了ntk,因为base参数是500w,一般4k训练时用的都是1w吧? 我感到疑惑的是没拓展context的模型的base参数也是500w,并且声称是外推4k->32k,这合理吗?如果长度只拓展8倍,为什么要设置base为500w(而不是大约8w)呢?

I (and most developers) hope the final prompt would be like, making chatml template as an example, ``` user 2+2=? assistant ``` The str in python is ```user\n2+2=?\nassistant\n``` If we...

Hi, I wonder if the loss is normal after converting and training mixtral with megatron at your computer. I apply this PR and the initial loss is quite high, which...

Hi, I fix a bug in my script and now the initial loss is normal *(around 2.3 in arxiv dataset). Thanks for your contribution! also, I have an extra question,...

Some tips (might be helpful) 1. Decrease `actor_rollout_ref.rollout.n` 2. Ensure the setting `export VLLM_ATTENTION_BACKEND=XFORMERS` 3. Decrease `actor_rollout_ref.actor.ppo_micro_batch_size` 4. Decrease `actor_rollout_ref.rollout.log_prob_micro_batch_size` and `actor_rollout_ref.ref.log_prob_micro_batch_size` 5. Decrease `data.max_response_length`

@puneeshkhanna Great! I'm looking forward to it. Actually the proposed solution was just reverting the commit of decouple lr, thus it indeed could not work when there are decoupled lrs.

Same here. This issue can be pretty serious, and needs to be fixed very soon.

> Does this problem occur only when Tensor Parallelism (TP) > 1 and Data Parallelism (DP) > 1? Currently, I am using DistributedOptimizer with TP = 1 and DP >...