TransformerEngine
TransformerEngine copied to clipboard
did not get improvement from tp/sp overlap
i run examples of te_comm_gemm_overlap.py, and remove backwards code, only forward code.
compared to tp allreduce, the tp/sp allgather + reduce scatter is slower
my gpu is 8 h20
is there any other args needs to change?