Tony

Results 2 comments of Tony

Hi, Thank you very much for the reply! I find it very useful! However, another doubt has been raised. Considering the issue of the masking method, I have another doubt...

There is this prenorm and postnorm issue. I believe the author went for postnorm as it stabilizes the training. GPT2 experienced unstable training issue due to prenorm, and since then...