Guoguo Chen

Results 12 comments of Guoguo Chen

I recently ran into the same issue for larger datasets. I was trying to use `--start-batch` from icefall to resume the training but it was loading the data for more...

@pzelasko Let me make sure that I understand this correctly. As a quick fix, Piotr you are suggesting that when we resume from a certain checkpoint, we initialize `DynamicBucketingSampler` with...

Thanks @pzelasko ! I have a huge cuts.jsonl.gz file (hundreds of gigabytes) so I'll see how it works with `shuf`. I'll also take a look at sharding

@danpovey sorry it's unzipped manifest, it's sitting on the disk at around 450G and the the zipped would be somewhere around 40G. But this is just a portion of the...

I prefer option one, that is to provide working scripts for each downstream toolkit. Here is what I have in mind: 1. Under each toolkit, we have a script to...

Thanks, we will get back to this.

Which downloading host was it (you should be able to see it from the logs)? I got another person asking about a similar issue. @wwfcnu

When you run the command `utils/download_gigaspeech.sh`, could you provide the host parameter, something line `utils/download_gigaspeech.sh --host speechocean` and see if that will be able to download the missing files? @wwfcnu

There could be issues with the MagicData server. I'm downloading from tsinghua and see if we have the same issue. In the meanwhile could you try `bash utils/download_gigaspeech.sh --host tsinghua...