torchtitan issues

run sdpa with dtensor

4

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #180 * #285 * #161 * #172 This PR gets rid of the manual adjustment of num of heads in attention layers,...

tianyu-l

CLA Signed

Checkpoint saves failing for eager mode training

7

There seems to be some tricky timeout issue during checkpoint saves. Failing for most runs for me on multiple machines, ### Steps to reproduce: 1. git clone and install torchtrain...

chauhang

bug

wip pipelinestage

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #174 * #161 * #172

wconstab

CLA Signed

Grad scaler not in train state

2

Grad scaler factor needs to be saved in train state for proper reloading.

BadrYoubiIdrissi

Loss curve spikes on amalagamated datasets - need full scale shuffler in dataloader

5

As part of e2e training, encountered wild loss curve spikes: After additional hyperparam tuning and further investigation, the root cause is that we are reading the dataset sequentially, so to...

lessw2020

enhancement

metrics - add L1 gradient norm tracking

to help monitor training stability.

lessw2020

enhancement

integrate with nccl-exp

1

to enable things like zero-copy

wanchaol

enhancement

consider - enable streaming attention as default for llama models (1-4M context)

for the price of 4 additional tokens (first four) we can enable streaming window attention and enable extremely long context length (1-4M?). "we introduce StreamingLLM, an efficient framework that enables...

lessw2020

enhancement

torchtitan
torchtitan copied to clipboard

Metadata

run sdpa with dtensor

Checkpoint saves failing for eager mode training

FSDP2 based HSDP support

Add HSDP + TP/SP support

wip pipelinestage

Grad scaler not in train state

Loss curve spikes on amalagamated datasets - need full scale shuffler in dataloader

metrics - add L1 gradient norm tracking

integrate with nccl-exp

consider - enable streaming attention as default for llama models (1-4M context)

← Metadata

Owner

Metadata

torchtitan torchtitan copied to clipboard

Metadata

← Metadata

Owner

Metadata

torchtitan
torchtitan copied to clipboard