[BERT/PyTorch] How can we use create_datasets_from_start.sh for BERT pretraining
Related to Model/Framework(s) (e.g. GNMT/PyTorch or FasterTransformer/All)
BERT/PyTorch
In Readme.md of https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT it is mentioned that on running create_datasets_from_start.sh, it will generate pretraining dataset for BERT. However whenever I tried to run the given shell script, it resuled in an error: download_wikipedia: command not found, I believe this is happening because, lddl is moved to a different repo here : https://github.com/NVIDIA/LDDL?, If that is the reason, what are the steps that I need to inorder to generate a pre training dataset, do we need any sudo privileges for running lddl on a slurm cluster. We are using slurm, and I don't have any sudo privileges, if lddl requires sudo privileges, do we have any alternatives for using lddl?
The script likely assumes that certain dependencies, including the download_wikipedia command, are available in your environment. However, it seems that there might be changes or issues with the dependencies.