Converting text corpora to HDF5 format

Open Remorax opened this issue 4 years ago • 1 comments

Hello,

Thank you for providing access to this wonderful repository, it is truly very interesting and shall be very helpful to me as part of my university experiments.

However, could you please let me know how to convert a new text corpus to the HDF5 format expected by your code? Specifically, I would like to know how to generate:

hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5 and hdf5_lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5
test_128 and test_512
uncased_L-12_H-768_A-12 and uncased_L-24_H-1024_A-16

If any further details are required, please let me know. Look forward to hearing from you soon.

Jul 19 '21 17:07 Remorax

Hey,

Thanks for raising the question. So basically hetseq generates "HDF5" files following the logic provided by NVIDIA at https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/data/create_datasets_from_start.sh with downloaded wikipedia. You may need to adatp the code on customized data.

"hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5 and hdf5_lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5" are basically two formats for BERT phase1 and phase2. These two are generated from wikipedia by me, you can just use it to run BERT with hetseq.
"test_128 and test_512" are just subset of 1 to do debugging and fast running.
"uncased_L-12_H-768_A-12 and uncased_L-24_H-1024_A-16" are just some tensorflow ckpt, the only thing we require is the "vocab" to transform "tokens" into their "input_ids".

Please let me know if it answers your question or you need any help.

Jul 20 '21 02:07 yifding