Guoguo Chen
Guoguo Chen
I recently ran into the same issue for larger datasets. I was trying to use `--start-batch` from icefall to resume the training but it was loading the data for more...
@pzelasko Let me make sure that I understand this correctly. As a quick fix, Piotr you are suggesting that when we resume from a certain checkpoint, we initialize `DynamicBucketingSampler` with...
Thanks @pzelasko ! I have a huge cuts.jsonl.gz file (hundreds of gigabytes) so I'll see how it works with `shuf`. I'll also take a look at sharding
@danpovey sorry it's unzipped manifest, it's sitting on the disk at around 450G and the the zipped would be somewhere around 40G. But this is just a portion of the...
Great, thanks! @pzelasko
I prefer option one, that is to provide working scripts for each downstream toolkit. Here is what I have in mind: 1. Under each toolkit, we have a script to...
Thanks, we will get back to this.
Which downloading host was it (you should be able to see it from the logs)? I got another person asking about a similar issue. @wwfcnu
When you run the command `utils/download_gigaspeech.sh`, could you provide the host parameter, something line `utils/download_gigaspeech.sh --host speechocean` and see if that will be able to download the missing files? @wwfcnu
There could be issues with the MagicData server. I'm downloading from tsinghua and see if we have the same issue. In the meanwhile could you try `bash utils/download_gigaspeech.sh --host tsinghua...