How does auto_parallel work?

Open Hudayday opened this issue 1 year ago • 1 comments

Hi, I just noticed that auto_parallel support is available. However, after reviewing the code, I still have no idea how it determines the best configuration. It seems to solve the resharding cost graph as an ILP problem to find the lowest cost.

Is this correct? Does it apply to all models?

Apr 19 '24 04:04 Hudayday

Yes, the AutoPP implementation is based on Alpa, you can find more details about how to model auto parallelization as an ILP problem.

Does it apply to all models?

You can find all supported native op in LAYER_TYPE_2_NODE_TYPE and all supported plugin in PLUGIN_LAYER_TYPE_2_NODE_TYPE in tensorrt_llm/auto_parallel/node_graph.py. Any model that only uses layers in this range can enable auto parallel. For example, LLaMA and GPT is supported, but Mixtral is unsupported since the MoE plugin is not included. We will add cost models for more plugins in the future versions.

Apr 19 '24 09:04 yuxianq