daejin
daejin
Uploaded codes in repo does not seemed to support some parameters such as shortcut, batch-norm and bias terms. Does LAP only work on weights except above terms?
I'm encountering an issue where gradients become NaN during the training of the Gemma2 model with transformers and flash-attn. I used soft-capping for training. Environment: transformers @ git+https://github.com/huggingface/transformers.git@ac946aac257cadfa8264fa4a284cd0ea1061c5b5 flash-attn==2.6.1 torch==2.3.1
In the recent commit, I have noticed an inconsistency in the configuration of the `query_pre_attn_scalar` parameter between the 9B and 27B models in this repository. Specifically: In the 9B model,...