Jianfeng Wang comments

Results 6 comments of


                                            Jianfeng Wang

During the training, the loss first drops and then rises

I met a similar situation, it seems due to the sdpa. Closing sdpa may solve it. https://github.com/karpathy/nanoGPT/blob/e58f0cfa9466dafe226b51ce6678e2b8fac652d5/model.py#L53

During the training, the loss first drops and then rises

Set Adamw's eps to 1e-5 also works for me with turning on the flash attention.

During the training, the loss first drops and then rises

> I set the eps from 1e-8 to 1e-5, the convergence is much slower. Have you noticed this? I do notice the slower convergence at the beginning but it will...

Should `loss` be divided by `gradient_accumulation_steps`?

I see your point. > because Adam is scale invariant, so multiplying the scale by any constant is a no-op Though I think it's not entirely scale invariant due to...

flash-attention v2 with activation checkpointing (no_reentrant) raise Runtime Error

> @wjfwzzc Excuse me, I'm interesting about you traceback printing of you python, irrelavant about this issue though.. Could I know how to let python print the stack trace like...

flash-attention v2 with activation checkpointing (no_reentrant) raise Runtime Error

ping @tridao ，would you fix this issue or provide some workaround? FSDP + activation checkpointing is kind of a common setting for large transformer training.