Tony
Results
2
comments of
Tony
Hi, Thank you very much for the reply! I find it very useful! However, another doubt has been raised. Considering the issue of the masking method, I have another doubt...
There is this prenorm and postnorm issue. I believe the author went for postnorm as it stabilizes the training. GPT2 experienced unstable training issue due to prenorm, and since then...