lingvo icon indicating copy to clipboard operation
lingvo copied to clipboard

Questions about ASR frontend, specifically Librispeech, feature normalization

Open albertz opened this issue 4 years ago • 1 comments

Hi,

I'm checking your ASR frontend, specifically the Librispeech audio feature extraction, and have some questions.

References: Librispeech params, ASR encoder, ASR model, create ASR features, ExtractLogMelFeatures, MelAsrFrontend

I don't see that feature normalization is used in any way. Is that right? So the initial convolutional network gets in unnormalized features?

There is an option per_bin_mean/per_bin_stddev but this no not used?

I wonder a bit about the 32768 factor. As input to MelAsrFrontend, you expect the values to be in range [-32768..32768]? But shouldn't you divide by 32768 at some point in the pipeline? I see that there is the option use_divide_stream which would do that, but this is also not used. Is that right? There is the src_pcm_scale option in AsrFrontendConfig but I also don't see this being used anywhere?

What numerical values do you expect for the unnormalized features? By using your code, I end up with these statistics for the 80-dim log-mel features:

  Mean:
[6.571567  6.8085003 6.874515  6.9014635 7.014142  7.0960007 7.154998
 7.2196846 7.2804766 7.3604364 7.5162773 7.6646695 7.7035103 7.637239
 7.473917  7.3341737 7.334874  7.225426  7.2850995 7.330272  7.3985167
 7.37795   7.296777  7.3162036 7.3547645 7.3710914 7.365237  7.408713
 7.4906807 7.561958  7.7213554 7.8515496 7.96462   8.009668  8.006353
 8.028718  8.093952  8.1421175 8.137211  8.187237  8.285318  8.338231
 8.355931  8.366059  8.428564  8.457295  8.45581   8.557338  8.637221
 8.672467  8.788814  8.787889  8.667367  8.60016   8.601487  8.72531
 8.768343  8.765634  8.858865  8.845884  8.873101  8.831113  8.854687
 8.934903  8.90494   8.948756  8.950672  8.937693  8.968388  9.017417
 8.98474   8.953554  9.017656  9.010184  9.012505  9.011251  8.993835
 8.985477  8.975625  8.9696865]
  Std dev:
[2.0369635 2.096931  2.1227295 2.1050122 2.0581489 2.0478609 2.0781274
 2.111383  2.1236782 2.1237605 2.153045  2.1902032 2.1789725 2.1117716
 2.0372825 2.0394926 2.0196817 1.947741  1.9082164 1.9618858 1.960082
 1.8687389 1.791357  1.7570691 1.7195601 1.7076842 1.7179631 1.7446057
 1.7433041 1.7451456 1.7246099 1.6775978 1.7071667 1.7530293 1.7220892
 1.6955647 1.668047  1.701594  1.7442849 1.7660844 1.7131468 1.6875786
 1.7154576 1.651276  1.6555542 1.7089554 1.8156201 1.8667307 1.7612113
 1.7350646 1.7320263 1.7157421 1.7632391 1.8203604 1.848493  1.8341905
 1.7994169 1.7613424 1.8461764 1.8262035 1.8009105 1.8069444 1.7533652
 1.7506559 1.7582294 1.8100908 1.8118345 1.8222826 1.8585421 1.8347679
 1.833038  1.8561351 1.838355  1.8722489 1.9128579 1.9025422 1.8568758
 1.864883  1.85355   1.8346307]
  Min/max:
[0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         1.081316   1.196871
 1.2115035  0.81268585 0.6653169  0.23646453 1.0013729  0.78377634
 1.0490758  0.37899855 0.07468933 0.02383682 0.         0.
 1.5028754  2.9381104  3.1084766  2.3929093  1.4824951  0.50826544
 1.0819429  2.375446   2.7917457  2.9367018  2.50226    2.8797576
 2.7869585  3.0858362  2.721441   2.890882   1.1783836  0.7760011
 3.267177   3.534093   2.9064052  3.580449   2.0504556  0.9798477
 0.47552437 0.4765737  3.0911071  2.3247538  2.8063457  2.8832946
 2.8879871  3.6785111  3.0960853  3.9359105  2.2698429  2.3523924
 4.1431756  1.4312835  2.4378788  4.199857   4.1510873  3.359309
 3.776203   3.579252   3.8776345  3.7463636  3.3691368  0.47623247
 3.9832573  3.867248  ] /
[10.206617  10.944173  11.139801  11.130556  11.729167  11.843884
 11.478071  11.72813   11.813193  12.02717   12.456151  12.265145
 12.660903  12.499597  12.844959  13.138891  13.047909  12.737233
 13.1444235 12.870635  12.814672  12.34872   12.536846  12.46696
 11.983264  12.251521  12.125826  12.2784195 12.577657  12.5084095
 12.271909  12.40977   13.049571  13.354869  13.52251   13.364965
 13.063794  13.076281  13.131835  12.594194  12.473833  12.474021
 12.613151  12.120035  12.36114   12.519782  13.078986  13.060618
 13.451843  13.620312  13.377903  13.563233  13.519144  13.6905985
 13.992317  13.707613  13.95073   14.305102  14.90581   15.447515
 15.098798  14.792583  14.573465  14.620636  14.31828   14.933836
 14.767882  14.567602  14.487499  14.37275   14.498172  14.244799
 14.251987  14.410744  14.358433  14.431413  14.4958935 14.791843
 14.440226  14.67428  ]

Does that look right? And these are the values the conv net gets? I would have expected that you need to normalize them in some way.

albertz avatar Mar 03 '21 10:03 albertz

We can do BN or Ln by oberselves

Mddct avatar Mar 24 '21 01:03 Mddct