DeepLearningExamples icon indicating copy to clipboard operation
DeepLearningExamples copied to clipboard

[ELECTRA/TF2] Option To Allow scripts/run_pretraining.sh To Use "wiki_only"

Open psharpe99 opened this issue 2 years ago • 0 comments

Related to ELECTRA/TF2

Is your feature request related to a problem? Please describe. The README shows that the datasets can be created from wiki-only: /workspace/electra/data/create_datasets_from_start.sh wiki_books but when you then continue to pretrain using the README instruction bash scripts/run_pretraining.sh it complains about the file/directory not existing. Looking at the run_pretraining.sh script, it has DATASET_P1="tfrecord_lower_case_1_seq_len_128_random_seed_12345/books_wiki_en_corpus/train/pretrain_data*" # change this for other datasets DATASET_P2="tfrecord_lower_case_1_seq_len_512_random_seed_12345/books_wiki_en_corpus/train/pretrain_data*" # change this for other datasets which are preset to the books_wiki directory, with the comment that these need to be (manually) "changed" for other datasets (e.g. wiki-only) Changing these manually to the 'wikicorpus_en' directory allowed the pretraining to succeed, but the script ideally shouldn't need editing.

Describe the solution you'd like It should be a simple change to include a command-line option to the run_pretraining script for "wiki-only" .

Describe alternatives you've considered Alternatively, it should be documented in the README that this script file needs to be editted if running only from wiki data.

Additional context none

psharpe99 avatar Jun 30 '23 10:06 psharpe99