Result For SpeechTokenizer
Here is the result for SpeechTokenizer.
The bit rate is 2kbps, following are the results:
Results in exps/results.txt
Codec SUPERB application evaluation
Stage 1: Run speech emotion recognition. Acc: 72.15%
Stage 2: Run speaker related evaluation. EER: 4.03%
Stage 3: Run automatic speech recognition. WER: 4.55%
Stage 4: Run audio event classification. ACC: 25.50%
Result in src/codec_metrics/exps/results.txt
Log results
File Name: crema_d.log Codec SUPERB objective metric evaluation on crema_d
Stage 1: Run SDR evaluation. SDR: mean score is: -29.90983049070145
Stage 2: Run Mel Spectrogram Loss. mel_loss: mean score is: 5.345735
Stage 3: Run STOI. stoi: mean score is: 0.06024890838574476
Stage 4: Run PESQ. pesq: mean score is: 1.586073912382126
File Name: esc50.log Codec SUPERB objective metric evaluation on esc50
Stage 1: Run SDR evaluation. SDR: mean score is: -22.282276880645814
Stage 2: Run Mel Spectrogram Loss. mel_loss: mean score is: 3.4074209
File Name: fluent_speech_commands.log Codec SUPERB objective metric evaluation on fluent_speech_commands
Stage 1: Run SDR evaluation. SDR: mean score is: 1.5112133717223253
Stage 2: Run Mel Spectrogram Loss. mel_loss: mean score is: 0.8877456
Stage 3: Run STOI. stoi: mean score is: 0.8648300690857609
Stage 4: Run PESQ. pesq: mean score is: 2.170962030887604
File Name: fsd50k.log Codec SUPERB objective metric evaluation on fsd50k
Stage 1: Run SDR evaluation. SDR: mean score is: -21.45771079855064
Stage 2: Run Mel Spectrogram Loss. mel_loss: mean score is: 3.1137948
File Name: gunshot_triangulation.log Codec SUPERB objective metric evaluation on gunshot_triangulation
Stage 1: Run SDR evaluation. SDR: mean score is: -22.950851389668035
Stage 2: Run Mel Spectrogram Loss. mel_loss: mean score is: 4.621136
File Name: libri2Mix_test.log Codec SUPERB objective metric evaluation on libri2Mix_test
Stage 1: Run SDR evaluation. SDR: mean score is: -3.846337947640395
Stage 2: Run Mel Spectrogram Loss. mel_loss: mean score is: 0.9027287
Stage 3: Run STOI. stoi: mean score is: 0.8309377170272262
Stage 4: Run PESQ. pesq: mean score is: 1.5058157062530517
File Name: librispeech.log Codec SUPERB objective metric evaluation on librispeech
Stage 1: Run SDR evaluation. SDR: mean score is: 1.0211239468849096
Stage 2: Run Mel Spectrogram Loss. mel_loss: mean score is: 0.8223095
Stage 3: Run STOI. stoi: mean score is: 0.8872668136911973
Stage 4: Run PESQ. pesq: mean score is: 2.2581932806968688
File Name: quesst.log Codec SUPERB objective metric evaluation on quesst
Stage 1: Run SDR evaluation. SDR: mean score is: -1.774289102870904
Stage 2: Run Mel Spectrogram Loss. mel_loss: mean score is: 1.153448
Stage 3: Run STOI. stoi: mean score is: 0.7758606059083771
Stage 4: Run PESQ. pesq: mean score is: 1.8245106658550223
File Name: snips_test_valid_subset.log Codec SUPERB objective metric evaluation on snips_test_valid_subset
Stage 1: Run SDR evaluation. SDR: mean score is: 3.7615257663215895
Stage 2: Run Mel Spectrogram Loss. mel_loss: mean score is: 0.8986037
Stage 3: Run STOI. stoi: mean score is: 0.9141771654461831
Stage 4: Run PESQ. pesq: mean score is: 2.2321277034282683
File Name: vox_lingua_top10.log Codec SUPERB objective metric evaluation on vox_lingua_top10
Stage 1: Run SDR evaluation. SDR: mean score is: -27.182861328199774
Stage 2: Run Mel Spectrogram Loss. mel_loss: mean score is: 5.430982
Stage 3: Run STOI. stoi: mean score is: 0.14532493265232807
Stage 4: Run PESQ. pesq: mean score is: 1.6926373445987701
File Name: voxceleb1.log Codec SUPERB objective metric evaluation on voxceleb1
Stage 1: Run SDR evaluation. SDR: mean score is: -1.9323934995843512
Stage 2: Run Mel Spectrogram Loss. mel_loss: mean score is: 0.823112
Stage 3: Run STOI. stoi: mean score is: 0.8241731080501418
Stage 4: Run PESQ. pesq: mean score is: 1.9483790636062621
Average SDR for speech datasets: -4.06314554192561 Average Mel_Loss for speech datasets: 1.5598471285714286 Average STOI for speech datasets: 0.7489386302658877 Average PESQ for speech datasets: 1.9475179707608352 Average SDR for audio datasets: -22.23027968958767 Average Mel_Loss for audio datasets: 3.714117233333333
If possible, could you follow section 4.2 of https://codecsuperb.github.io/Codec-SUPERB-rule.pdf to submit the inference instructions for your model?
We will present your speech tokenizer results in our hidden set in the SLT challenge. Please update your inference instructions as soon as possible. Thank you.
Best regards,
Codec-SUPERB team