Do not override compile-time podTemplate resources with config defaults if they're set
Tracking issue
https://github.com/flyteorg/flyte/issues/6482
Why are the changes needed?
Pod template resources for CPU and memory were being overwritten by task resource defaults, while other resources like nvidia.com/gpu and rdma/infiniband were preserved correctly. This broke the expected resource priority order for users relying on PodTemplates.
In our case, we're using a multi-cluster deployment of Flyte, and our infrastructure is very heterogeneous in terms of resources, storage and network. We want to make sure our "infra" related code (such as PodTemplate) is separated from the Flyte workflow definitions, and we expect that limits for each env for the steps are set within "infra" code. This is especially handy where we have a lot of tasks that are occupying a node fully, regardless of the amount of resources there are (e.g in case there is 10 CPUs in nodes - we assign 9 CPUs, in case there is 200 CPUs - we assign 199, and so on).
The problem: When users specify resources in compile-time PodTemplates like this:
resources:
requests:
cpu: "55" # ❌ Was being overwritten by task defaults
memory: "1837Gi" # ❌ Was being overwritten by task defaults
nvidia.com/gpu: "120" # ✅ Was preserved correctly
rdma/infiniband: "63" # ✅ Was preserved correctly
Only GPU and RDMA resources were respected, while CPU and memory were replaced with platform defaults, causing inconsistent behavior and breaking user expectations.
What changes were proposed in this pull request?
This PR implements proper resource priority handling by fetching pod template resources early and merging them with the correct precedence order:
-
Early Pod Template Fetching: Modified
ApplyFlytePodConfigurationto fetch the base pod template before applying customizations, enabling access to pod template resources during container customization. -
Smart Resource Extraction: Added
ExtractContainerResourcesFromPodTemplate()function with intelligent fallback logic:- Exact container name match → "primary" → "default" → empty
-
Enhanced Merging Logic:
- Added
AddFlyteCustomizationsToContainerWithPodTemplate()for pod template-aware resource handling - Added
MergeResourcesIfMissing()to merge resources only when not already set (preserving higher priority) - Maintained backward compatibility with existing
AddFlyteCustomizationsToContainer()
- Added
-
Correct Priority Implementation: Now properly follows the intended resource priority order:
- Resource overrides (highest priority)
- Container inline resources
- PodTemplate resources ← Now correctly preserved!
- Platform resource defaults (lowest priority, with limits enforcement)
-
GetBasePodTemplate method exposure to public, as we need it in places outside
flytek8s(e.g in kfoperatorscommon) -
Respect base PodTemplate in kfoperator - do not override resources in case they exists in the
PodTemplate
The fix also applies to ResourceCustomizationModeEnsureExistingResourcesInRange mode to ensure consistent behavior across all resource customization modes.
How was this patch tested?
Unit Tests
- All existing tests pass (zero regression)
- Added comprehensive test
TestAddFlyteCustomizationsToContainerWithPodTemplatecovering the exact user scenario - Added test
TestExtractContainerResourcesFromPodTemplatevalidating extraction logic with all fallback cases - Tested resource priority handling with platform limits enforcement
Manual Testing
Verified the fix on devel k8s cluster - resolves the exact scenario from the issue where pod template CPU and memory resources are now preserved alongside GPU and RDMA resources for both, using a kfp plugin and a raw task.
P.S
I am not extremely familiar with Flyte codebase, so will be happy to adjust changes in case I missed something, thank you folks!
Summary by Bito
This pull request enhances resource management in Flyte by ensuring user-defined CPU and memory settings in pod templates are preserved and not overridden by default values. It introduces new functions for resource extraction and merging, improving reliability in heterogeneous environments. Comprehensive tests have been added to validate these changes and ensure backward compatibility.
Codecov Report
Attention: Patch coverage is 75.53191% with 23 lines in your changes missing coverage. Please review.
Project coverage is 58.68%. Comparing base (
aaf6fec) to head (0329bad). Report is 19 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #6483 +/- ##
==========================================
+ Coverage 58.50% 58.68% +0.17%
==========================================
Files 940 938 -2
Lines 71582 71466 -116
==========================================
+ Hits 41878 41938 +60
+ Misses 26525 26342 -183
- Partials 3179 3186 +7
| Flag | Coverage Δ | |
|---|---|---|
| unittests-datacatalog | 59.03% <ø> (ø) |
|
| unittests-flyteadmin | 56.22% <ø> (-0.01%) |
:arrow_down: |
| unittests-flytecopilot | 39.56% <ø> (+8.56%) |
:arrow_up: |
| unittests-flytectl | 64.72% <ø> (ø) |
|
| unittests-flyteidl | 76.12% <ø> (ø) |
|
| unittests-flyteplugins | 61.16% <75.53%> (+0.08%) |
:arrow_up: |
| unittests-flytepropeller | 54.83% <ø> (+0.05%) |
:arrow_up: |
| unittests-flytestdlib | 64.04% <ø> (+0.01%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Could you please provide some result screenshots showing that this PR is working as expected if possible?
Thanks!
Hey @machichima, thanks for a thorough review, I've adjusted most of the issues and tested in my development environment - the PR still serves its purpose after all suggested changes, thanks!
Failed tests are due to rate limits, the rest of the tests are seemingly passing with no issues
Please take another look.
Hey @davidmirror-ops, thanks for looking into it! I've fixed your small typo suggestion, please approve!