torchtitan
torchtitan copied to clipboard
A PyTorch native library for large-scale model training
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #301 * #300 * #161 Get sm_count another way to work around issues with meta-device tracing Note: this PR isn't strictly safe...
per user request, we don't currently have any info on how to do this. (basically extend the hf_dataset file).
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #340 * #337 * __->__ #318 runs PP+DP and PP+TP without issue, runs PP+TP+DP with decreasing loss, but fails DCP save Supports only...
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #318 * __->__ #322 * #321 A few small changes here lets manual PP frontend 'reconfigure' a whole transformer model to a stage's...
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #318 * #322 * __->__ #321 Unchanged: we precompute freqs_cis for max_seqlen, >> seqlen for a given batch. Changed: instead of slicing self.freqs_cis...
This PR adds the option to selectively compile just the norm layers only, and is mainly targeted at RMSNorm. By compiling just the norm layers when using rmsnorm, we get...
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #319
I noticed that there are two parts of implementation that are related to model initialization. ### Instancing the model with meta tensor https://github.com/pytorch/torchtitan/blob/f72a2a0da0bdfc394faaab9b3c0f35d0b6f5be50/train.py#L177-L181 ### Doing explicit model initalization https://github.com/pytorch/torchtitan/blob/f72a2a0da0bdfc394faaab9b3c0f35d0b6f5be50/train.py#L209-L210 The...