torchtitan icon indicating copy to clipboard operation
torchtitan copied to clipboard

A PyTorch native library for large-scale model training

Results 270 torchtitan issues
Sort by recently updated
recently updated
newest added

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #180 * #285 * #161 * #172 This PR gets rid of the manual adjustment of num of heads in attention layers,...

CLA Signed

There seems to be some tricky timeout issue during checkpoint saves. Failing for most runs for me on multiple machines, ### Steps to reproduce: 1. git clone and install torchtrain...

bug

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #174 * #161 * #172

CLA Signed

Grad scaler factor needs to be saved in train state for proper reloading.

As part of e2e training, encountered wild loss curve spikes: After additional hyperparam tuning and further investigation, the root cause is that we are reading the dataset sequentially, so to...

enhancement

to help monitor training stability.

enhancement

to enable things like zero-copy

enhancement

for the price of 4 additional tokens (first four) we can enable streaming window attention and enable extremely long context length (1-4M?). "we introduce StreamingLLM, an efficient framework that enables...

enhancement