TensorRT-LLM chore: Refactor disaggregated serving scripts

To simplify disaggregated serving deployment and reduce duplicated code, the disaggregated workers and server can now be launched with:

python3 ${EXAMPLE_DIR}/launch_disaggregated_workers.py -c ${CONFIG_FILE} 
trtllm-serve disaggregated -c ${CONFIG_FILE}

respectively, instead of

mpirun --allow-run-as-root -n ${NUM_RANKS} python3 ${EXAMPLE_DIR}/launch_disaggregated_workers.py -c ${CONFIG_FILE} 
python3 ${EXAMPLE_DIR}/launch_disaggregated_server.py -c ${CONFIG_FILE}

The number of mpiranks can be automatically determined from the config file hence there's no need for the user to calculate the total number of MPI ranks.

Also, there was some duplicated code between launch_disaggregated_server.py and trtllm-serve. So now the disaggregated server can be launched with trtllm-serve disaggregated.

Mar 25 '25 17:03 pcastonguay

/bot run

Mar 25 '25 18:03 pcastonguay

PR_Github #464 [ run ] triggered by Bot

Mar 25 '25 18:03 niukuo

/bot run --disable-fail-fast

Mar 25 '25 20:03 pcastonguay

PR_Github #468 [ run ] triggered by Bot

Mar 25 '25 20:03 niukuo

PR_Github #464 [ run ] completed with state ABORTED /LLM/main/L0_MergeRequest_PR pipeline #397 completed with status: 'FAILURE'

Mar 25 '25 20:03 niukuo

PR_Github #468 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #401 completed with status: 'FAILURE'

Mar 26 '25 00:03 niukuo

/bot run

Mar 26 '25 01:03 pcastonguay

PR_Github #488 [ run ] triggered by Bot

Mar 26 '25 01:03 niukuo

PR_Github #488 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #420 completed with status: 'FAILURE'

Mar 26 '25 03:03 niukuo

/bot run --disable-fail-fast

Mar 26 '25 14:03 pcastonguay

PR_Github #593 [ run ] triggered by Bot

Mar 26 '25 14:03 niukuo

PR_Github #593 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #503 completed with status: 'SUCCESS'

Mar 26 '25 16:03 niukuo

/bot help

Mar 26 '25 17:03 pcastonguay

how to run disagg with slurm once the PR has merged?

Mar 28 '25 03:03 chuangz0

/bot run --only-multi-gpu-test

Mar 28 '25 03:03 chuangz0

PR_Github #668 [ run ] triggered by Bot

Mar 28 '25 03:03 tensorrt-cicd

/bot run multi-gpu-test

Mar 28 '25 03:03 chuangz0

PR_Github #670 Bot args parsing error!

Mar 28 '25 04:03 tensorrt-cicd

PR_Github #668 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #562 (Partly Tested) completed with status: 'FAILURE'

Mar 28 '25 07:03 tensorrt-cicd

/bot run --disable-fail-fast

Mar 31 '25 16:03 pcastonguay

PR_Github #794 [ run ] triggered by Bot

Mar 31 '25 16:03 tensorrt-cicd

PR_Github #794 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #642 completed with status: 'FAILURE'

Mar 31 '25 20:03 tensorrt-cicd

/bot run --disable-fail-fast

Apr 01 '25 18:04 pcastonguay

PR_Github #922 [ run ] triggered by Bot

Apr 01 '25 19:04 tensorrt-cicd

PR_Github #922 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #728 completed with status: 'SUCCESS'

Apr 01 '25 21:04 tensorrt-cicd

/bot run --multi-gpu-test

Apr 02 '25 13:04 pcastonguay

/bot run --add-multi-gpu-test

Apr 02 '25 14:04 pcastonguay

PR_Github #1023 [ run ] triggered by Bot

Apr 02 '25 14:04 tensorrt-cicd

PR_Github #1024 [ run ] triggered by Bot

Apr 02 '25 14:04 tensorrt-cicd

PR_Github #1023 [ run ] completed with state ABORTED

Apr 02 '25 14:04 tensorrt-cicd