LayerNormlization layer has an edge case that causes NaN

Open delzac opened this issue 5 years ago • 0 comments

Hi, fyi for anyone who gets NaN during training with a model that uses LayerNormlization .

The current implementation in cntk has an edge case that causes NaN:

def layer_normalize(x):
      mean = reduce_mean(x) # normalize w.r.t. actual sample statistics
      x0 = x - mean;
      std = sqrt (reduce_mean (x0 * x0))  # EDGE CASE: you need the epsilon inside the sqrt!
      if (epsilon != 0):
          std += epsilon
      x_hat = x0 / std
      return x_hat * scale + bias    # denormalize with learned parameters

In the edge case, reduce_mean(x0 * x0) would return a negative value! This is probably due to some rounding error. Unfortunately, sqrt of a negative value will immediately cause NaN. So the solution is to shift the epsilon into the sqrt instead.

I have already resolved this in my library cntkx. You can install it with a pip install cntkx and from cntkx.layers import LayerNormlization and everything will work fine. cntkx is written in pure python so there's no dependency issues too.

Apr 19 '20 13:04 delzac