Kite0011
Kite0011
> > > @szhengac You are correct, LAMB and LARS implementations that are not aware of ZeRO will not work correctly with ZeRO. This is not a fundamental limitation of...
> > > > @szhengac You are correct, LAMB and LARS implementations that are not aware of ZeRO will not work correctly with ZeRO. This is not a fundamental limitation...
Hi @lucidrains! Would you mean i can just imply GAU on cross-attention model such as t5? I foud GAU works very well on bert model