Megatron-DeepSpeed icon indicating copy to clipboard operation
Megatron-DeepSpeed copied to clipboard

Preprocessing from arrow file to load an HF dataset

Open TevenLeScao opened this issue 3 years ago • 1 comments

This adds an option to launch preprocessing from an HF dataset (loaded from an arrow file for now as that's the use-case on JZ) rather than just jsonlines.

TevenLeScao avatar Mar 11 '22 05:03 TevenLeScao

Update: this version is ~4 times slower with the change from column format to row format, compared to a hacky version that just gets the column for a certain feature. Including it as _hack.py.

TevenLeScao avatar Mar 11 '22 06:03 TevenLeScao