TransformerEngine icon indicating copy to clipboard operation
TransformerEngine copied to clipboard

did not get improvement from tp/sp overlap

Open artetaout opened this issue 11 months ago • 0 comments

i run examples of te_comm_gemm_overlap.py, and remove backwards code, only forward code.

compared to tp allreduce, the tp/sp allgather + reduce scatter is slower

my gpu is 8 h20

is there any other args needs to change?

artetaout avatar Mar 18 '25 16:03 artetaout