Regarding Google Speech Command dataset
Hi, thanks for the great code. I've tried to reproduce the results. However, I found two confusing issues,
-
The original dataset seems not to include
_silence_folder, I didn't see_silence_/721f767c_nohash_2.wavlisted intest.txt -
The dataset has 64k samples, while in the code only 29k used. Why is it so? Are the results from your paper produced from the data included in the following listed files?
$ cat test.txt | wc -l
3081
$ cat train.txt | wc -l
22246
$ cat valid.txt | wc -l
3093
Looking forward to your reply,
Sincerely, Bo
Hi Bo,
-
Since
_silenceis literally empty audio, we don't have to have an exact wav file. When you check data loader, it automatically generates empty audio on-the-fly. https://github.com/hyperconnect/TC-ResNet/blob/8ccbff3a45590247d8c54cc82129acb90eecf5c8/datasets/audio_data_wrapper.py#L146-L174 -
Would you double-check our instruction at https://github.com/hyperconnect/TC-ResNet/tree/master/speech_commands_dataset? You can find Google's original preprocess codes at here but as we mentioned above, we slightly modify split function. Even if there are 30 keywords in the original dataset, we select 10 keywords as previous studies did, which is mentioned in the paper. That the reason why the number of selected wav samples is different.
@miautoml Have u solved your problem? I have meet a same question, and learn how to solve it
@justin-hpcnt Could I ask a question that how to create three '*.txt' files? There is any ideas about how to get files in https://github.com/hyperconnect/TC-ResNet/tree/master/speech_commands_dataset or papers.
- no '_silence' file
- using '__hash__' to split to three files , I have got those(same wanted words), and they are sorted:
(base) ..T-Thread/WakeUp-Xiaorui/data❯ cat training_list.txt | wc -l
30769
(base) ..T-Thread/WakeUp-Xiaorui/data❯ cat validation_list.txt | wc -l
3703
(base) ..T-Thread/WakeUp-Xiaorui/data❯ cat testing_list.txt | wc -l
4074
thanks a lot.I would read 'speech_commands_dataset/readme' again.