OpenDiT longseq sequence_parallel param update issuse

Hi,

In the code referenced below: https://github.com/NUS-HPC-AI-Lab/OpenDiT/blob/c15d82b738d0efb7f8f9e79c2f5277cbb417c8e2/opendit/modules/attn.py#L93-L97

only part of the self.qkv.weight is involed in the forward process and therefore only part of the weight is updated by optimizer.

But how do you make sure that the full self.qkv.weight is correctly updated and saved?

Mar 05 '24 07:03 KimmiShi

our parallelism strategy will make sure the grad will be synced between gpus

Mar 05 '24 09:03 oahzxl

For example, sequence parallel size=2, and we have 2 gpus, the qkv.weight is sliced into 2 slices; On rank0, the first half of qkv.weight has grad, the second part does not contain grad, is this correct? And on rank1, the first half of qkv.weight does not contain grad, the second part contains grad.

Then the grads of qkv.weight should not be directly all-reduced between two gpus since they are not DP, how does the parallelism strategy handle this condition?

Mar 05 '24 09:03 KimmiShi

For example, sequence parallel size=2, and we have 2 gpus, the qkv.weight is sliced into 2 slices; On rank0, the first half of qkv.weight has grad, the second part does not contain grad, is this correct? And on rank1, the first half of qkv.weight does not contain grad, the second part contains grad.

Then the grads of qkv.weight should not be directly all-reduced between two gpus since they are not DP, how does the parallelism strategy handle this condition?

same question

Mar 08 '24 01:03 yhy-2000