torchtitan
torchtitan copied to clipboard
A PyTorch native library for large-scale model training
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #180 * #285 * #161 * #172 This PR gets rid of the manual adjustment of num of heads in attention layers,...
There seems to be some tricky timeout issue during checkpoint saves. Failing for most runs for me on multiple machines, ### Steps to reproduce: 1. git clone and install torchtrain...
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #174 * #161 * #172
Grad scaler factor needs to be saved in train state for proper reloading.
As part of e2e training, encountered wild loss curve spikes: After additional hyperparam tuning and further investigation, the root cause is that we are reading the dataset sequentially, so to...
for the price of 4 additional tokens (first four) we can enable streaming window attention and enable extremely long context length (1-4M?). "we introduce StreamingLLM, an efficient framework that enables...