CNTK
CNTK copied to clipboard
LayerNormlization layer has an edge case that causes NaN
Hi, fyi for anyone who gets NaN during training with a model that uses LayerNormlization .
The current implementation in cntk has an edge case that causes NaN:
def layer_normalize(x):
mean = reduce_mean(x) # normalize w.r.t. actual sample statistics
x0 = x - mean;
std = sqrt (reduce_mean (x0 * x0)) # EDGE CASE: you need the epsilon inside the sqrt!
if (epsilon != 0):
std += epsilon
x_hat = x0 / std
return x_hat * scale + bias # denormalize with learned parameters
In the edge case, reduce_mean(x0 * x0) would return a negative value! This is probably due to some rounding error. Unfortunately, sqrt of a negative value will immediately cause NaN. So the solution is to shift the epsilon into the sqrt instead.
I have already resolved this in my library cntkx. You can install it with a pip install cntkx and from cntkx.layers import LayerNormlization and everything will work fine. cntkx is written in pure python so there's no dependency issues too.