CNTK icon indicating copy to clipboard operation
CNTK copied to clipboard

LayerNormlization layer has an edge case that causes NaN

Open delzac opened this issue 5 years ago • 0 comments

Hi, fyi for anyone who gets NaN during training with a model that uses LayerNormlization .

The current implementation in cntk has an edge case that causes NaN:

def layer_normalize(x):
      mean = reduce_mean(x) # normalize w.r.t. actual sample statistics
      x0 = x - mean;
      std = sqrt (reduce_mean (x0 * x0))  # EDGE CASE: you need the epsilon inside the sqrt!
      if (epsilon != 0):
          std += epsilon
      x_hat = x0 / std
      return x_hat * scale + bias    # denormalize with learned parameters

In the edge case, reduce_mean(x0 * x0) would return a negative value! This is probably due to some rounding error. Unfortunately, sqrt of a negative value will immediately cause NaN. So the solution is to shift the epsilon into the sqrt instead.

I have already resolved this in my library cntkx. You can install it with a pip install cntkx and from cntkx.layers import LayerNormlization and everything will work fine. cntkx is written in pure python so there's no dependency issues too.

delzac avatar Apr 19 '20 13:04 delzac