powernorm
powernorm copied to clipboard
Question regarding the batch norm vs masked batch norm
The paper mentions that batch normalization can have large fluctuations in the batch statistics. This occurs in vanilla BN because it calculates the statistics over input of varying lengths padded with 0. I was wondering whether this fluctuation still occurs in the masked version of BN (where padding is ignored). Additionally, how much of a performance gain can be expected by switching from BN to masked BN? Thanks.