Linsong Chu issues

Results 10 issues of


                                            Linsong Chu

A revisit on improving the performance of Data Loader

We have been noticing a slowdown on training that was introduced by our dataloader. Upon further checking, we identified the issue coming from the fact that our dataset class is...

A write-up on Meta Device Init x Pretraining

## Scope This write-up only applies to "initial model init". For cases that require loading a checkpoint (continue-pretraining, fine-tuning and inference), this is not needed as any init would be...

revert "raise Dynamo accumulated cache size limit"

We recently added a commit to raise Dynamo accumulated cache size limit to make compile work with large models like 70b whose num_layer is greater than default limit (64): https://github.com/foundation-model-stack/fms-fsdp/pull/45#issuecomment-2002564455....

add FLOP counter

add a flop counter to the code with a bool flag. it is already available in the flop_counter branch but will require some extra work to prettify it and integrate...

enhancement

coqa not working

This happened once before and got fixed: https://github.com/EleutherAI/lm-evaluation-harness/issues/898 But now it seems not working again with same error, at least on my end. ```bash File "/home/lchu/.conda/envs/main/lib/python3.9/site-packages/datasets/builder.py", line 1726, in _prepare_split_single...

bug

[Checkpoint] Rewrite Checkpointing

### Background It is time to rewrite current checkpointing for a few reasons: 1. Functionality support: move from FSDPv1 logic to DTensor logic, so we support DCP for FSDPv2, TP,...

[PP + EP][Master Thread] Enable Pipeline Parallelism (PP) and Expert Parallelism (EP)

## Background We are starting an effort on enabling PP + EP for our training experiments on Mamba MoE. Comparing to other parallelism (FSDP, CP, TP), PP can be much...

[PP + EP][Stage I] PP x Mamba

## Working Items ### Modifications on Mamba code (https://github.com/lchu6/mamba/commit/fd4fa086012300bea759fc281733c16bb482e03d) - [x] **`nn.ModuleList -> nn.ModuleDict`.** Currently the Mamba blocks are stored in `ModuleList` which has a **dynamic FQN** that will cause...

Add support for Mamba-MoE

https://github.com/foundation-model-stack/fms-fsdp/issues/128

add Mamba-MoE training support

A detailed list of TODOs Mamba repo - [x] create a Mamba-MoE branch in Mamba repo @fabianlim FMS-FSDP repo - [x] add mamba moe configs - [x] modify loss to...