OpenDiT icon indicating copy to clipboard operation
OpenDiT copied to clipboard

longseq sequence_parallel param update issuse

Open KimmiShi opened this issue 1 year ago • 3 comments

Hi,

In the code referenced below: https://github.com/NUS-HPC-AI-Lab/OpenDiT/blob/c15d82b738d0efb7f8f9e79c2f5277cbb417c8e2/opendit/modules/attn.py#L93-L97

only part of the self.qkv.weight is involed in the forward process and therefore only part of the weight is updated by optimizer.

But how do you make sure that the full self.qkv.weight is correctly updated and saved?

KimmiShi avatar Mar 05 '24 07:03 KimmiShi

our parallelism strategy will make sure the grad will be synced between gpus

oahzxl avatar Mar 05 '24 09:03 oahzxl

For example, sequence parallel size=2, and we have 2 gpus, the qkv.weight is sliced into 2 slices; On rank0, the first half of qkv.weight has grad, the second part does not contain grad, is this correct? And on rank1, the first half of qkv.weight does not contain grad, the second part contains grad.

Then the grads of qkv.weight should not be directly all-reduced between two gpus since they are not DP, how does the parallelism strategy handle this condition?

KimmiShi avatar Mar 05 '24 09:03 KimmiShi

For example, sequence parallel size=2, and we have 2 gpus, the qkv.weight is sliced into 2 slices; On rank0, the first half of qkv.weight has grad, the second part does not contain grad, is this correct? And on rank1, the first half of qkv.weight does not contain grad, the second part contains grad.

Then the grads of qkv.weight should not be directly all-reduced between two gpus since they are not DP, how does the parallelism strategy handle this condition?

same question

yhy-2000 avatar Mar 08 '24 01:03 yhy-2000