Questions about ASR frontend, specifically Librispeech, feature normalization
Hi,
I'm checking your ASR frontend, specifically the Librispeech audio feature extraction, and have some questions.
References: Librispeech params, ASR encoder, ASR model, create ASR features, ExtractLogMelFeatures, MelAsrFrontend
I don't see that feature normalization is used in any way. Is that right? So the initial convolutional network gets in unnormalized features?
There is an option per_bin_mean/per_bin_stddev but this no not used?
I wonder a bit about the 32768 factor. As input to MelAsrFrontend, you expect the values to be in range [-32768..32768]? But shouldn't you divide by 32768 at some point in the pipeline? I see that there is the option use_divide_stream which would do that, but this is also not used. Is that right? There is the src_pcm_scale option in AsrFrontendConfig but I also don't see this being used anywhere?
What numerical values do you expect for the unnormalized features? By using your code, I end up with these statistics for the 80-dim log-mel features:
Mean:
[6.571567 6.8085003 6.874515 6.9014635 7.014142 7.0960007 7.154998
7.2196846 7.2804766 7.3604364 7.5162773 7.6646695 7.7035103 7.637239
7.473917 7.3341737 7.334874 7.225426 7.2850995 7.330272 7.3985167
7.37795 7.296777 7.3162036 7.3547645 7.3710914 7.365237 7.408713
7.4906807 7.561958 7.7213554 7.8515496 7.96462 8.009668 8.006353
8.028718 8.093952 8.1421175 8.137211 8.187237 8.285318 8.338231
8.355931 8.366059 8.428564 8.457295 8.45581 8.557338 8.637221
8.672467 8.788814 8.787889 8.667367 8.60016 8.601487 8.72531
8.768343 8.765634 8.858865 8.845884 8.873101 8.831113 8.854687
8.934903 8.90494 8.948756 8.950672 8.937693 8.968388 9.017417
8.98474 8.953554 9.017656 9.010184 9.012505 9.011251 8.993835
8.985477 8.975625 8.9696865]
Std dev:
[2.0369635 2.096931 2.1227295 2.1050122 2.0581489 2.0478609 2.0781274
2.111383 2.1236782 2.1237605 2.153045 2.1902032 2.1789725 2.1117716
2.0372825 2.0394926 2.0196817 1.947741 1.9082164 1.9618858 1.960082
1.8687389 1.791357 1.7570691 1.7195601 1.7076842 1.7179631 1.7446057
1.7433041 1.7451456 1.7246099 1.6775978 1.7071667 1.7530293 1.7220892
1.6955647 1.668047 1.701594 1.7442849 1.7660844 1.7131468 1.6875786
1.7154576 1.651276 1.6555542 1.7089554 1.8156201 1.8667307 1.7612113
1.7350646 1.7320263 1.7157421 1.7632391 1.8203604 1.848493 1.8341905
1.7994169 1.7613424 1.8461764 1.8262035 1.8009105 1.8069444 1.7533652
1.7506559 1.7582294 1.8100908 1.8118345 1.8222826 1.8585421 1.8347679
1.833038 1.8561351 1.838355 1.8722489 1.9128579 1.9025422 1.8568758
1.864883 1.85355 1.8346307]
Min/max:
[0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 1.081316 1.196871
1.2115035 0.81268585 0.6653169 0.23646453 1.0013729 0.78377634
1.0490758 0.37899855 0.07468933 0.02383682 0. 0.
1.5028754 2.9381104 3.1084766 2.3929093 1.4824951 0.50826544
1.0819429 2.375446 2.7917457 2.9367018 2.50226 2.8797576
2.7869585 3.0858362 2.721441 2.890882 1.1783836 0.7760011
3.267177 3.534093 2.9064052 3.580449 2.0504556 0.9798477
0.47552437 0.4765737 3.0911071 2.3247538 2.8063457 2.8832946
2.8879871 3.6785111 3.0960853 3.9359105 2.2698429 2.3523924
4.1431756 1.4312835 2.4378788 4.199857 4.1510873 3.359309
3.776203 3.579252 3.8776345 3.7463636 3.3691368 0.47623247
3.9832573 3.867248 ] /
[10.206617 10.944173 11.139801 11.130556 11.729167 11.843884
11.478071 11.72813 11.813193 12.02717 12.456151 12.265145
12.660903 12.499597 12.844959 13.138891 13.047909 12.737233
13.1444235 12.870635 12.814672 12.34872 12.536846 12.46696
11.983264 12.251521 12.125826 12.2784195 12.577657 12.5084095
12.271909 12.40977 13.049571 13.354869 13.52251 13.364965
13.063794 13.076281 13.131835 12.594194 12.473833 12.474021
12.613151 12.120035 12.36114 12.519782 13.078986 13.060618
13.451843 13.620312 13.377903 13.563233 13.519144 13.6905985
13.992317 13.707613 13.95073 14.305102 14.90581 15.447515
15.098798 14.792583 14.573465 14.620636 14.31828 14.933836
14.767882 14.567602 14.487499 14.37275 14.498172 14.244799
14.251987 14.410744 14.358433 14.431413 14.4958935 14.791843
14.440226 14.67428 ]
Does that look right? And these are the values the conv net gets? I would have expected that you need to normalize them in some way.
We can do BN or Ln by oberselves