Guoguo Chen comments

Results 12 comments of


                                            Guoguo Chen

Take much time for train_dl.sampler to load state dict

I recently ran into the same issue for larger datasets. I was trying to use `--start-batch` from icefall to resume the training but it was loading the data for more...

Take much time for train_dl.sampler to load state dict

@pzelasko Let me make sure that I understand this correctly. As a quick fix, Piotr you are suggesting that when we resume from a certain checkpoint, we initialize `DynamicBucketingSampler` with...

Take much time for train_dl.sampler to load state dict

Thanks @pzelasko ! I have a huge cuts.jsonl.gz file (hundreds of gigabytes) so I'll see how it works with `shuf`. I'll also take a look at sharding

Take much time for train_dl.sampler to load state dict

@danpovey sorry it's unzipped manifest, it's sitting on the disk at around 450G and the the zipped would be somewhere around 40G. But this is just a portion of the...

Take much time for train_dl.sampler to load state dict

Great, thanks! @pzelasko

Adding scoring scripts.

I prefer option one, that is to provide working scripts for each downstream toolkit. Here is what I have in mind: 1. Under each toolkit, we have a script to...

gigaspeech.json里没有audio/podcast/P0081-P0084

Which downloading host was it (you should be able to see it from the logs)? I got another person asking about a similar issue. @wwfcnu

gigaspeech.json里没有audio/podcast/P0081-P0084

When you run the command `utils/download_gigaspeech.sh`, could you provide the host parameter, something line `utils/download_gigaspeech.sh --host speechocean` and see if that will be able to download the missing files? @wwfcnu

gigaspeech.json里没有audio/podcast/P0081-P0084

There could be issues with the MagicData server. I'm downloading from tsinghua and see if we have the same issue. In the meanwhile could you try `bash utils/download_gigaspeech.sh --host tsinghua...