Ranger21
Ranger21 copied to clipboard
Gradient normalization lowers the maximum learning rate that can converge.
I found this problem while training ResNet18 on cifar100 for some experiment. I still haven't looked into this issue enough to find out what the cause is.