andrasiani
Results
2
issues of
andrasiani
Hi, I have a 1.5 B param GPT-XL pretrained teacher network in fp16 with requires_grad=False. The student network is a small GPT with 142 M params. I use pytorch lightning...
bug
training
Hi, I am trying to run a gpt2 model with blocksize 2048, and I cannot use batchsize larger than 16 because activation memory becomes too large. To reduce activation memory...
stale