Davis Wertheimer issues

Results 12 issues of


                                            Davis Wertheimer

Enable asynchronous dataloading

Current dataloader still causes gradual asymptotic slowdowns - likely because we have n_workers fixed to 0 in the dataloader. This forces the main process to also handle dataloading in a...

bug

enhancement

[speculator training] Speculator training

Add support for speculator training, piggybacking off the existing training utilities. Training script and speculator-specific utilities are inside the new `speculator` subfolder. Uses distributed setup, checkpointing, and dataloaders from this...

speculator training

IBM experimental dataloaders

This PR introduces an experimental PyTorch-native dataloader from IBM that is distributed, stateful, checkpointable, composable and rescalable. It is intended for use in large-scale model pretraining, particularly in research settings...

CLA Signed

Minimal implementation of muP scaling for Llama

Implement [muP scaling](https://arxiv.org/abs/2203.03466) for Llama models. Model follows muP scaling laws but introduces the minimal set of extra tunable hyperparameters that allows us to recover prior behavior - thus may...

Suppress spammy warnings

Current code prints multiple warnings from each gpu at the start of training, which clutters up the log. Updates dataloader and process group constructors to eliminate these warnings, respectively: ```...

Various dataloader updates and fixes

A collection of dataloader updates and fixes mirrored from the torchtitan repo. Changes include: FIXES FOR HANGS AND FREEZES: - Truncate long text docs to 1M characters - Allow LCG...

Add zloss and partial weight decay

Handling edge case of continued pretraining from finished run

When the dataloader loads from checkpoint, it expects a path to the checkpoints directory, from which it pulls the most recent checkpoint folder and loads the relevant data. This is...

Add support for FIM training

Adds support for FIM training (https://arxiv.org/pdf/2207.14255). Allows for SPM or PSM mode (or both) with `--fim_training` command arg. Passes unit tests but not yet tested with a small LLM. Will...

Rescalability layer

Implements rescaling of checkpoints to different world sizes and numbers of workers. User specifies in advance the number of data partitions, and when saving/loading checkpoints with different total workers, stateful...

CLA Signed