fms-fsdp icon indicating copy to clipboard operation
fms-fsdp copied to clipboard

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.

Results 40 fms-fsdp issues
Sort by recently updated
recently updated
newest added

Current dataloader still causes gradual asymptotic slowdowns - likely because we have n_workers fixed to 0 in the dataloader. This forces the main process to also handle dataloading in a...

bug
enhancement

Instructlab backend currently focuses on mistral fine tuning and I'm trying to maximize throughput for that. If anyone notices anything obvious or has any suggestions I'd truly appreciate it. @raghukiran1224...

Add support for speculator training, piggybacking off the existing training utilities. Training script and speculator-specific utilities are inside the new `speculator` subfolder. Uses distributed setup, checkpointing, and dataloaders from this...

speculator training

We have been noticing a slowdown on training that was introduced by our dataloader. Upon further checking, we identified the issue coming from the fact that our dataset class is...

For currently training a speculator using the specu-train branch, getting OOM error when trying to load a checkpoint in HuggingFace format. The model_type is "gpt_megatron". The script works fine for...

speculator training

## Scope This write-up only applies to "initial model init". For cases that require loading a checkpoint (continue-pretraining, fine-tuning and inference), this is not needed as any init would be...

We recently added a commit to raise Dynamo accumulated cache size limit to make compile work with large models like 70b whose num_layer is greater than default limit (64): https://github.com/foundation-model-stack/fms-fsdp/pull/45#issuecomment-2002564455....

add a flop counter to the code with a bool flag. it is already available in the flop_counter branch but will require some extra work to prettify it and integrate...

enhancement

The latest changes in 0.0.6 https://github.com/foundation-model-stack/foundation-model-stack/commit/eccd6028cec75f84ce1834a3a18649f5d8fc0641 break the [model conversion code](https://github.com/foundation-model-stack/fms-fsdp/blob/starcoder/fms_to_hf.py) I am switching back to the foundation-model-stack commit d04def43e9eb8a4e0adf7285c59dd66274e1b724 that still works