dbrx
dbrx copied to clipboard
What's the optimal parallel strategy using TensorRT-LLM?
Thanks for your great efforts first. I read the PR you opened in the TensorRT-LLM repo and noticed that EP +TP, PP + TP, and TP are supported during inference. May I ask which one is optimal? Specifically, as for the MoE layer, does EP or TP yield better performance?
cc: @megha95
TP is better as at lower batch sizes it allows better load balance. At higher batch sizes, they should be similar. We haven't benchmarked EP yet.