OpenDiT Questions about longseq

Hi, thanks for your great work.

I've conducted tests on two A40 machines utilizing accelerators. Executing the following command:

torchrun --standalone --nproc_per_node=2 train.py \
    --model DiT-XL/2 \
    --batch_size 3 \
    --num_classes 10

I observed that each epoch takes approximately 1.5 hours to complete, as indicated by: Epoch 0: 3%|▎ | 284/8333 [03:15<1:30:18, 1.49it/s, loss=0.173, step=283, global_step=283].

Conversely, after implementing all acceleration techniques outlined in the readme, using the command: However, when I apply all the accelerate operations mentioned in readme

torchrun --standalone --nproc_per_node=2 train.py \
    --model DiT-XL/2 \
    --batch_size 3 \
    --num_classes 10 \
    --sequence_parallel_type longseq \
    --sequence_parallel_size 2 \
    --enable_modulate_kernel \
    --enable_flashattn \
    --enable_layernorm_kernel

the duration for each epoch doubled to 3 hours: Epoch 0: 0%| | 34/16666 [06:33<3:17:52, 1.40it/s, loss=0.943, step=33, global_step=33].

Could you please explain the reason behind this increased processing time?

Mar 08 '24 09:03 yhy-2000

do not use sequence parallelism if not necessary. And add batch size as much as you can. you can follow our instructions in the readme for more details.

Mar 08 '24 09:03 oahzxl

Thank you for the guidance. However, I'm curious about the '80% speedup' mentioned in the readme. Could you clarify how to achieve this performance improvement?

Mar 08 '24 11:03 yhy-2000

enable all kernels (except modulate kernel because it has accuracy problem now) and use as large batch size as you can

Mar 08 '24 11:03 oahzxl