flyte icon indicating copy to clipboard operation
flyte copied to clipboard

Do not override compile-time podTemplate resources with config defaults if they're set

Open punkerpunker opened this issue 8 months ago • 2 comments

Tracking issue

https://github.com/flyteorg/flyte/issues/6482

Why are the changes needed?

Pod template resources for CPU and memory were being overwritten by task resource defaults, while other resources like nvidia.com/gpu and rdma/infiniband were preserved correctly. This broke the expected resource priority order for users relying on PodTemplates.

In our case, we're using a multi-cluster deployment of Flyte, and our infrastructure is very heterogeneous in terms of resources, storage and network. We want to make sure our "infra" related code (such as PodTemplate) is separated from the Flyte workflow definitions, and we expect that limits for each env for the steps are set within "infra" code. This is especially handy where we have a lot of tasks that are occupying a node fully, regardless of the amount of resources there are (e.g in case there is 10 CPUs in nodes - we assign 9 CPUs, in case there is 200 CPUs - we assign 199, and so on).

The problem: When users specify resources in compile-time PodTemplates like this:

resources:
  requests:
    cpu: "55"              # ❌ Was being overwritten by task defaults
    memory: "1837Gi"       # ❌ Was being overwritten by task defaults  
    nvidia.com/gpu: "120"  # ✅ Was preserved correctly
    rdma/infiniband: "63"  # ✅ Was preserved correctly

Only GPU and RDMA resources were respected, while CPU and memory were replaced with platform defaults, causing inconsistent behavior and breaking user expectations.

What changes were proposed in this pull request?

This PR implements proper resource priority handling by fetching pod template resources early and merging them with the correct precedence order:

  1. Early Pod Template Fetching: Modified ApplyFlytePodConfiguration to fetch the base pod template before applying customizations, enabling access to pod template resources during container customization.

  2. Smart Resource Extraction: Added ExtractContainerResourcesFromPodTemplate() function with intelligent fallback logic:

    • Exact container name match → "primary" → "default" → empty
  3. Enhanced Merging Logic:

    • Added AddFlyteCustomizationsToContainerWithPodTemplate() for pod template-aware resource handling
    • Added MergeResourcesIfMissing() to merge resources only when not already set (preserving higher priority)
    • Maintained backward compatibility with existing AddFlyteCustomizationsToContainer()
  4. Correct Priority Implementation: Now properly follows the intended resource priority order:

    1. Resource overrides (highest priority)
    2. Container inline resources
    3. PodTemplate resourcesNow correctly preserved!
    4. Platform resource defaults (lowest priority, with limits enforcement)
  5. GetBasePodTemplate method exposure to public, as we need it in places outside flytek8s (e.g in kfoperators common)

  6. Respect base PodTemplate in kfoperator - do not override resources in case they exists in the PodTemplate

The fix also applies to ResourceCustomizationModeEnsureExistingResourcesInRange mode to ensure consistent behavior across all resource customization modes.

How was this patch tested?

Unit Tests

  • All existing tests pass (zero regression)
  • Added comprehensive test TestAddFlyteCustomizationsToContainerWithPodTemplate covering the exact user scenario
  • Added test TestExtractContainerResourcesFromPodTemplate validating extraction logic with all fallback cases
  • Tested resource priority handling with platform limits enforcement

Manual Testing

Verified the fix on devel k8s cluster - resolves the exact scenario from the issue where pod template CPU and memory resources are now preserved alongside GPU and RDMA resources for both, using a kfp plugin and a raw task.

P.S

I am not extremely familiar with Flyte codebase, so will be happy to adjust changes in case I missed something, thank you folks!

Summary by Bito

This pull request enhances resource management in Flyte by ensuring user-defined CPU and memory settings in pod templates are preserved and not overridden by default values. It introduces new functions for resource extraction and merging, improving reliability in heterogeneous environments. Comprehensive tests have been added to validate these changes and ensure backward compatibility.

punkerpunker avatar Jun 03 '25 14:06 punkerpunker

Codecov Report

Attention: Patch coverage is 75.53191% with 23 lines in your changes missing coverage. Please review.

Project coverage is 58.68%. Comparing base (aaf6fec) to head (0329bad). Report is 19 commits behind head on master.

Files with missing lines Patch % Lines
...ns/go/tasks/pluginmachinery/flytek8s/pod_helper.go 11.11% 13 Missing and 3 partials :warning:
...tasks/pluginmachinery/flytek8s/container_helper.go 92.00% 4 Missing and 2 partials :warning:
.../plugins/k8s/kfoperators/common/common_operator.go 0.00% 1 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6483      +/-   ##
==========================================
+ Coverage   58.50%   58.68%   +0.17%     
==========================================
  Files         940      938       -2     
  Lines       71582    71466     -116     
==========================================
+ Hits        41878    41938      +60     
+ Misses      26525    26342     -183     
- Partials     3179     3186       +7     
Flag Coverage Δ
unittests-datacatalog 59.03% <ø> (ø)
unittests-flyteadmin 56.22% <ø> (-0.01%) :arrow_down:
unittests-flytecopilot 39.56% <ø> (+8.56%) :arrow_up:
unittests-flytectl 64.72% <ø> (ø)
unittests-flyteidl 76.12% <ø> (ø)
unittests-flyteplugins 61.16% <75.53%> (+0.08%) :arrow_up:
unittests-flytepropeller 54.83% <ø> (+0.05%) :arrow_up:
unittests-flytestdlib 64.04% <ø> (+0.01%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Jun 03 '25 15:06 codecov[bot]

Could you please provide some result screenshots showing that this PR is working as expected if possible?

Thanks!

machichima avatar Jun 07 '25 00:06 machichima

Hey @machichima, thanks for a thorough review, I've adjusted most of the issues and tested in my development environment - the PR still serves its purpose after all suggested changes, thanks!

Failed tests are due to rate limits, the rest of the tests are seemingly passing with no issues

Please take another look.

punkerpunker avatar Jun 23 '25 23:06 punkerpunker

Hey @davidmirror-ops, thanks for looking into it! I've fixed your small typo suggestion, please approve!

punkerpunker avatar Jul 01 '25 15:07 punkerpunker