fms-fsdp issues

Enable asynchronous dataloading

3

Current dataloader still causes gradual asymptotic slowdowns - likely because we have n_workers fixed to 0 in the dataloader. This forces the main process to also handle dataloading in a...

daviswer

bug

enhancement

maximize mistral throughput

2

Instructlab backend currently focuses on mistral fine tuning and I'm trying to maximize throughput for that. If anyone notices anything obvious or has any suggestions I'd truly appreciate it. @raghukiran1224...

aldopareja

[speculator training] Speculator training

11

Add support for speculator training, piggybacking off the existing training utilities. Training script and speculator-specific utilities are inside the new `speculator` subfolder. Uses distributed setup, checkpointing, and dataloaders from this...

daviswer

speculator training

A revisit on improving the performance of Data Loader

2

We have been noticing a slowdown on training that was introduced by our dataloader. Upon further checking, we identified the issue coming from the fact that our dataset class is...

lchu6

[peculator training] Update benchmark_speculator_logical.py to support gpt_bigcode/granite

9

@daviswer: should we also add the caller script?

sahilsuneja1

speculator training

[speculator training] Support for loading different HF checkpoints for speculator training

1

For currently training a speculator using the specu-train branch, getting OOM error when trying to load a checkpoint in HuggingFace format. The model_type is "gpt_megatron". The script works fine for...

pavi2707

speculator training

A write-up on Meta Device Init x Pretraining

## Scope This write-up only applies to "initial model init". For cases that require loading a checkpoint (continue-pretraining, fine-tuning and inference), this is not needed as any init would be...

lchu6

revert "raise Dynamo accumulated cache size limit"

We recently added a commit to raise Dynamo accumulated cache size limit to make compile work with large models like 70b whose num_layer is greater than default limit (64): https://github.com/foundation-model-stack/fms-fsdp/pull/45#issuecomment-2002564455....

lchu6

add FLOP counter

add a flop counter to the code with a bool flag. it is already available in the flop_counter branch but will require some extra work to prettify it and integrate...

lchu6

enhancement

The model conversion to hf is broken with the latest Fused GatedLinearUnit Support in ibm-fms 0.0.6

1

The latest changes in 0.0.6 https://github.com/foundation-model-stack/foundation-model-stack/commit/eccd6028cec75f84ce1834a3a18649f5d8fc0641 break the [model conversion code](https://github.com/foundation-model-stack/fms-fsdp/blob/starcoder/fms_to_hf.py) I am switching back to the foundation-model-stack commit d04def43e9eb8a4e0adf7285c59dd66274e1b724 that still works

thinkahead

fms-fsdp
fms-fsdp copied to clipboard

Metadata

Enable asynchronous dataloading

maximize mistral throughput

[speculator training] Speculator training

A revisit on improving the performance of Data Loader

[peculator training] Update benchmark_speculator_logical.py to support gpt_bigcode/granite

[speculator training] Support for loading different HF checkpoints for speculator training

A write-up on Meta Device Init x Pretraining

revert "raise Dynamo accumulated cache size limit"

add FLOP counter

The model conversion to hf is broken with the latest Fused GatedLinearUnit Support in ibm-fms 0.0.6

← Metadata

Owner

Metadata

fms-fsdp fms-fsdp copied to clipboard

Metadata

← Metadata

Owner

Metadata

fms-fsdp
fms-fsdp copied to clipboard