Rohan Varma
Rohan Varma
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #83055 * #83035 * #82892 - NamedTuple support is blocking MultiModal adoption. TODO: add test
### 🐛 Describe the bug Sometimes, _post_backward_hook will not fire if gradients were not accumulated on the FSDP managed parameter, such as if all parameters in an FSDP module were...
i.e. different clients can train different models
### 🐛 Describe the bug ``` class M(nn.Module): def __init__(self): super().__init__() self.a = nn.Linear(10, 10) self.b = nn.Linear(10, 10) def forward(self, x): a = self.a(x) b = self.b(x) return (a,...
### 🚀 The feature Should add some tests to ensure the right sharded grad scaler, no_sync ctx manager, etc is picked out when using composable FSDP ### Motivation, pitch ....
The API enforces that the wrapping policy just be a set of modules, which is sufficient for a few use cases but the underlying API offers more generality in terms...
https://github.com/pytorch/torchtune/pull/779 is adding QLoRA-13B, but we need to add CI for this as well.
This will save memory for GQA / MQA, but will require a bit of refactor to attention forward pass.
#### Context - In this PR, we introduce `TunePerfMonitor`, a utility class for tracking metrics across training. This class is meant to be flexible in the actual metrics that users...