Megatron-DeepSpeed
Megatron-DeepSpeed copied to clipboard
Preprocessing from arrow file to load an HF dataset
This adds an option to launch preprocessing from an HF dataset (loaded from an arrow file for now as that's the use-case on JZ) rather than just jsonlines.
Update: this version is ~4 times slower with the change from column format to row format, compared to a hacky version that just gets the column for a certain feature. Including it as _hack.py.