Wav2Vec2 pipeline feature extractor normalizes input over batch dimension, is it a feature or bug in design?

Open ivan-alles opened this issue 9 months ago • 0 comments

I'm tryining to undestand the intuition of the input normalization using layer norm like this:

waveforms = nn.functional.layer_norm(waveforms, waveforms.shape) link

If the input is [B, L], this code will normalize it accross batch elements. I.e. to compute the mean, it will sum up all values regardless of the batch element they belong to. The same for variance. Is this really the intended behaviour that one batch element can inluence another one?

The original paper states: The raw waveform input to the encoder is normalized to zero mean and unit variance. There is nothing about the normalization accross the batch.

I think, the right way is to normalize each batch element independently, and the code should be changed to: waveforms = nn.functional.layer_norm(waveforms, waveforms.shape[1:])

May 01 '25 16:05 ivan-alles