andrasiani

Results 2 issues of andrasiani

Hi, I have a 1.5 B param GPT-XL pretrained teacher network in fp16 with requires_grad=False. The student network is a small GPT with 142 M params. I use pytorch lightning...

bug
training

Hi, I am trying to run a gpt2 model with blocksize 2048, and I cannot use batchsize larger than 16 because activation memory becomes too large. To reduce activation memory...

stale