longseq sequence_parallel param update issuse
Hi,
In the code referenced below: https://github.com/NUS-HPC-AI-Lab/OpenDiT/blob/c15d82b738d0efb7f8f9e79c2f5277cbb417c8e2/opendit/modules/attn.py#L93-L97
only part of the self.qkv.weight is involed in the forward process and therefore only part of the weight is updated by optimizer.
But how do you make sure that the full self.qkv.weight is correctly updated and saved?
our parallelism strategy will make sure the grad will be synced between gpus
For example, sequence parallel size=2, and we have 2 gpus, the qkv.weight is sliced into 2 slices;
On rank0, the first half of qkv.weight has grad, the second part does not contain grad, is this correct? And on rank1, the first half of qkv.weight does not contain grad, the second part contains grad.
Then the grads of qkv.weight should not be directly all-reduced between two gpus since they are not DP, how does the parallelism strategy handle this condition?
For example, sequence parallel size=2, and we have 2 gpus, the
qkv.weightis sliced into 2 slices; On rank0, the first half of qkv.weight has grad, the second part does not contain grad, is this correct? And on rank1, the first half of qkv.weight does not contain grad, the second part contains grad.Then the grads of qkv.weight should not be directly all-reduced between two gpus since they are not DP, how does the
parallelism strategyhandle this condition?
same question