Enable GPTOSS GB200 DISAGG
This MR enables disaggregation for GPTOSS on GB200.
Modified files to add GPTOSS to Disagg runners and workflow.
Successful tests here: https://github.com/InferenceMAX/InferenceMAX/actions/runs/19353241086/job/55369372877
thanks for this contribution @jgangani
Can you explain what this means? is all of the datapoints just 4 gpus for prefill only and then 4 gpus for decode only? if not, can u explain the parallelism config & the conc for each datapoint?
/submit_disagg.sh mtp=off tp 1 1 1 512 20000 "0.9" 0 0 "128 256 512"
./submit_disagg.sh mtp=off tp 1 1 2 1024 20000 "0.9" 0 0 "64 128 256"
./submit_disagg.sh mtp=off tep 1 1 2 1024 20000 "0.9" 0 0 "64 256"
./submit_disagg.sh mtp=off tp 1 1 4 2048 20000 "0.9" 0 0 "8 16 32 64 128"
./submit_disagg.sh mtp=off tp 1 1 8 2048 20000 "0.9" 0 0 "1 2 4 8 16"
also @jgangani please merge this in main branch/release candidate instead of doing an side branch https://github.com/ai-dynamo/dynamo/compare/release/0.5.1-rc0.20251105...jthomson04/gpt-oss-disagg-slurm
thanks for this contribution @jgangani
Can you explain what this means? is all of the datapoints just 4 gpus for prefill only and then 4 gpus for decode only? if not, can u explain the parallelism config & the conc for each datapoint?
/submit_disagg.sh mtp=off tp 1 1 1 512 20000 "0.9" 0 0 "128 256 512" ./submit_disagg.sh mtp=off tp 1 1 2 1024 20000 "0.9" 0 0 "64 128 256" ./submit_disagg.sh mtp=off tep 1 1 2 1024 20000 "0.9" 0 0 "64 256" ./submit_disagg.sh mtp=off tp 1 1 4 2048 20000 "0.9" 0 0 "8 16 32 64 128" ./submit_disagg.sh mtp=off tp 1 1 8 2048 20000 "0.9" 0 0 "1 2 4 8 16"
Following is the order:
also @jgangani please merge this in main branch/release candidate instead of doing an side branch https://github.com/ai-dynamo/dynamo/compare/release/0.5.1-rc0.20251105...jthomson04/gpt-oss-disagg-slurm
Yes, that was the goal. wanted to test out the MR before merging this into release branch. Will update.
@jgangani thanks! Can u please enable 1k/8k and 1k/1k on gptoss gb200 in this PR too? Thanks!
@functionstackx Switched to dynamo release branch.
@jgangani thanks! Can u please enable 1k/8k and 1k/1k on gptoss gb200 in this PR too? Thanks!
I am working on 1k1k DISAGG pareto configs next. 1k8k DISAGG probably will be on par with AGG since it is predominantly doing just decode. Hence, I recommend we merge this MR first. does it make sense?
if u can submit gb200 agg for 1k/8k in this PR too
we're gonna hold off on this til #251 gets merged this week
@jgangani so sorry brother but can you please rebase with main following the convention set forth in https://github.com/InferenceMAX/InferenceMAX/pull/251 ?
Yes, I am working on it. Will open another MR based off post-251 merge.
@jgangani hi! where are we on this?
@jgangani hi! where are we on this?
GB200 DISAGG for 8k1k is ready with refactored code. I can create an MR right away if need be. Still working through 1k1k config exploration. I will need few more days for 1k1k