Megatron-DeepSpeed
Megatron-DeepSpeed copied to clipboard
Add valid data
As requested by @TevenLeScao
We want to:
1. train on a mix of languages
2. do validation on English-only
By default, Megatron-deepspeed uses just a fraction of the training set as the validation set, so we can't have multilingual training data and English-only validation data at the moment. In order to launch experiments, we'd need just a dirty hack to be able to use an English-only validation set
- [x] Add additional argument for valid data
- [x] Implement valid data-loader
- [x] Run a dummy test
@TevenLeScao Did you get a chance to take a look into this pull?
Hey Maruf, sorry, not yet, I'm a bit swamped at the moment and the priority switched to cleaning OSCAR-ml additionally ourselves before launching anything on it, maybe @ibeltagy can review in the meantime?