torchtitan icon indicating copy to clipboard operation
torchtitan copied to clipboard

A PyTorch native library for large-scale model training

Results 270 torchtitan issues
Sort by recently updated
recently updated
newest added

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #301 * #300 * #161 Get sm_count another way to work around issues with meta-device tracing Note: this PR isn't strictly safe...

CLA Signed

per user request, we don't currently have any info on how to do this. (basically extend the hf_dataset file).

documentation
enhancement

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #340 * #337 * __->__ #318 runs PP+DP and PP+TP without issue, runs PP+TP+DP with decreasing loss, but fails DCP save Supports only...

CLA Signed

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #318 * __->__ #322 * #321 A few small changes here lets manual PP frontend 'reconfigure' a whole transformer model to a stage's...

CLA Signed

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #318 * #322 * __->__ #321 Unchanged: we precompute freqs_cis for max_seqlen, >> seqlen for a given batch. Changed: instead of slicing self.freqs_cis...

CLA Signed

This PR adds the option to selectively compile just the norm layers only, and is mainly targeted at RMSNorm. By compiling just the norm layers when using rmsnorm, we get...

CLA Signed

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #319

CLA Signed

I noticed that there are two parts of implementation that are related to model initialization. ### Instancing the model with meta tensor https://github.com/pytorch/torchtitan/blob/f72a2a0da0bdfc394faaab9b3c0f35d0b6f5be50/train.py#L177-L181 ### Doing explicit model initalization https://github.com/pytorch/torchtitan/blob/f72a2a0da0bdfc394faaab9b3c0f35d0b6f5be50/train.py#L209-L210 The...

question