contrib
contrib copied to clipboard
Loss become nan with Adam and AdamW as base optimizers
Loss becomes nan after training for ~20 steps - loss value stabily decreases and becomes nan with Adam or AdamW optimizers. In case of simple SGD usage it works well.