xpk icon indicating copy to clipboard operation
xpk copied to clipboard

feat: Pathways use single resource group

Open samos123 opened this issue 5 months ago • 5 comments

Jobs requesting TPU resources may also have requests for CPU and memory. However when pathways is enabled, Kueue will not be able to admit such jobs since there is no cpu and memory quota.

This fix adds a very high number of CPU and memory for TPU/GPU resources and merges the pathways resource group with the accelerator resource group.

This also allows us to run AXLearn jobs without having to make changes manually.

Follow up from: https://github.com/AI-Hypercomputer/xpk/pull/574 this time with a branch within xpk repo.

samos123 avatar Aug 20 '25 19:08 samos123

Seems @lukebaumann encountered an issue when not using create-pathways. A potential fix is to remove the create-pathways command all together since it doesn't seem needed. We may be able to get rid of cpu resource flavor which would also unblock AXLearn jobs.

samos123 avatar Aug 22 '25 18:08 samos123

Seems NAP without pathways is also impacted. I think we need a different fix. See #603

samos123 avatar Aug 23 '25 02:08 samos123

This PR should solve NAP and AXLearn support as well. Would prefer to get this merged and will check with Luke on why it wasn't working for him.

samos123 avatar Aug 24 '25 00:08 samos123

It's still needed for us to be able to run AXLearn on xpk clusters.

samos123 avatar Oct 28 '25 14:10 samos123

I think it's set in https://github.com/AI-Hypercomputer/xpk/blob/f0626b93a0a7d3530e2daafeeb553e9a8168a0a3/src/xpk/core/kueue_manager.py#L324. @samos123 could you verify?

jamOne- avatar Oct 28 '25 14:10 jamOne-

This pull request is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Nov 28 '25 02:11 github-actions[bot]