dbrx icon indicating copy to clipboard operation
dbrx copied to clipboard

What's the optimal parallel strategy using TensorRT-LLM?

Open iteratorlee opened this issue 1 year ago • 2 comments

Thanks for your great efforts first. I read the PR you opened in the TensorRT-LLM repo and noticed that EP +TP, PP + TP, and TP are supported during inference. May I ask which one is optimal? Specifically, as for the MoE layer, does EP or TP yield better performance?

iteratorlee avatar Mar 28 '24 09:03 iteratorlee

cc: @megha95

hanlint avatar Mar 28 '24 13:03 hanlint

TP is better as at lower batch sizes it allows better load balance. At higher batch sizes, they should be similar. We haven't benchmarked EP yet.

dskhudia avatar Mar 28 '24 17:03 dskhudia