Jianfeng Wang
Jianfeng Wang
I met a similar situation, it seems due to the sdpa. Closing sdpa may solve it. https://github.com/karpathy/nanoGPT/blob/e58f0cfa9466dafe226b51ce6678e2b8fac652d5/model.py#L53
Set Adamw's eps to 1e-5 also works for me with turning on the flash attention.
> I set the eps from 1e-8 to 1e-5, the convergence is much slower. Have you noticed this? I do notice the slower convergence at the beginning but it will...
I see your point. > because Adam is scale invariant, so multiplying the scale by any constant is a no-op Though I think it's not entirely scale invariant due to...
> @wjfwzzc Excuse me, I'm interesting about you traceback printing of you python, irrelavant about this issue though.. Could I know how to let python print the stack trace like...
ping @tridao ,would you fix this issue or provide some workaround? FSDP + activation checkpointing is kind of a common setting for large transformer training.