What's the optimal parallel strategy using TensorRT-LLM?

Open iteratorlee opened this issue 1 year ago • 2 comments

Thanks for your great efforts first. I read the PR you opened in the TensorRT-LLM repo and noticed that EP +TP, PP + TP, and TP are supported during inference. May I ask which one is optimal? Specifically, as for the MoE layer, does EP or TP yield better performance?

Mar 28 '24 09:03 iteratorlee

cc: @megha95

Mar 28 '24 13:03 hanlint

TP is better as at lower batch sizes it allows better load balance. At higher batch sizes, they should be similar. We haven't benchmarked EP yet.

Mar 28 '24 17:03 dskhudia