Linsong Chu

Results 10 issues of Linsong Chu

We have been noticing a slowdown on training that was introduced by our dataloader. Upon further checking, we identified the issue coming from the fact that our dataset class is...

## Scope This write-up only applies to "initial model init". For cases that require loading a checkpoint (continue-pretraining, fine-tuning and inference), this is not needed as any init would be...

We recently added a commit to raise Dynamo accumulated cache size limit to make compile work with large models like 70b whose num_layer is greater than default limit (64): https://github.com/foundation-model-stack/fms-fsdp/pull/45#issuecomment-2002564455....

add a flop counter to the code with a bool flag. it is already available in the flop_counter branch but will require some extra work to prettify it and integrate...

enhancement

This happened once before and got fixed: https://github.com/EleutherAI/lm-evaluation-harness/issues/898 But now it seems not working again with same error, at least on my end. ```bash File "/home/lchu/.conda/envs/main/lib/python3.9/site-packages/datasets/builder.py", line 1726, in _prepare_split_single...

bug

### Background It is time to rewrite current checkpointing for a few reasons: 1. Functionality support: move from FSDPv1 logic to DTensor logic, so we support DCP for FSDPv2, TP,...

## Background We are starting an effort on enabling PP + EP for our training experiments on Mamba MoE. Comparing to other parallelism (FSDP, CP, TP), PP can be much...

## Working Items ### Modifications on Mamba code (https://github.com/lchu6/mamba/commit/fd4fa086012300bea759fc281733c16bb482e03d) - [x] **`nn.ModuleList -> nn.ModuleDict`.** Currently the Mamba blocks are stored in `ModuleList` which has a **dynamic FQN** that will cause...

https://github.com/foundation-model-stack/fms-fsdp/issues/128

A detailed list of TODOs Mamba repo - [x] create a Mamba-MoE branch in Mamba repo @fabianlim FMS-FSDP repo - [x] add mamba moe configs - [x] modify loss to...