feat: Pathways use single resource group
Jobs requesting TPU resources may also have requests for CPU and memory. However when pathways is enabled, Kueue will not be able to admit such jobs since there is no cpu and memory quota.
This fix adds a very high number of CPU and memory for TPU/GPU resources and merges the pathways resource group with the accelerator resource group.
This also allows us to run AXLearn jobs without having to make changes manually.
Follow up from: https://github.com/AI-Hypercomputer/xpk/pull/574 this time with a branch within xpk repo.
Seems @lukebaumann encountered an issue when not using create-pathways. A potential fix is to remove the create-pathways command all together since it doesn't seem needed. We may be able to get rid of cpu resource flavor which would also unblock AXLearn jobs.
Seems NAP without pathways is also impacted. I think we need a different fix. See #603
This PR should solve NAP and AXLearn support as well. Would prefer to get this merged and will check with Luke on why it wasn't working for him.
It's still needed for us to be able to run AXLearn on xpk clusters.
I think it's set in https://github.com/AI-Hypercomputer/xpk/blob/f0626b93a0a7d3530e2daafeeb553e9a8168a0a3/src/xpk/core/kueue_manager.py#L324. @samos123 could you verify?
This pull request is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.