fms-fsdp
fms-fsdp copied to clipboard
Various dataloader updates and fixes
A collection of dataloader updates and fixes mirrored from the torchtitan repo. Changes include:
FIXES FOR HANGS AND FREEZES:
- Truncate long text docs to 1M characters
- Allow LCG to advance after reaching an empty doc
FIXES FOR CRASHES:
- Check dataset paths for typos rather than returning empty subdatasets
- Handle edge case where rescaling upward can cause an epoch to finish and resulting data shard to appear empty
- Every worker now owns at least one doc from each owned file, unless file is empty (preventing empty shards on small datasets)
- Skip empty data files
FIXES CONVERGENCE BEHAVIOR:
- File sharding now accounts for file size (BREAKS BACKWARD COMPATIBILITY)
QUALITY OF LIFE:
- Add support for FIM training (default off, enable by setting cfg.spm_rate or cfg.psm_rate to nonzero values). Precludes #125
- Add support for multiple column names when reading from data files
- Add manual alert message for completing an epoch without returning any data
- User can specify how many document chunks to yield before manually inserting EOS and breaking the doc. Does not fully truncate, as the doc is returned to on the next iter() call.
- More informative error reporting when a set of empty shards is detected