flyte icon indicating copy to clipboard operation
flyte copied to clipboard

[BUG] Resources from compile-time PodTemplate are ignored when no inline or overrides are specified (replaced with config defaults)

Open punkerpunker opened this issue 8 months ago • 1 comments

Bug Description

When using Kubernetes PodTemplates with resource specifications as a compile-time PodTemplate, CPU and memory values defined in the PodTemplate are being overwritten by task_resource_defaults, while other resources like nvidia.com/gpu and rdma/infiniband are preserved correctly.

Expected Behavior

Pod template resources should have higher priority than platform defaults but lower priority than explicit resource overrides, following this priority order:

  1. Resource overrides (highest priority)
  2. Container inline resources
  3. PodTemplate resources ← This should be preserved
  4. Platform/task resource defaults (lowest priority)

Current Behavior

CPU and memory from PodTemplate are being overwritten by task resource defaults, while GPU and RDMA resources work correctly.

Example

Given this PodTemplate resources section:

resources:
  requests:
    cpu: "55"
    memory: "1837Gi"
    nvidia.com/gpu: "120"
    rdma/infiniband: "63"
  limits:
    cpu: "55"
    memory: "1837Gi"
    nvidia.com/gpu: "120"
    rdma/infiniband: "63"

Current Result:

  • nvidia.com/gpu and rdma/infiniband are preserved
  • cpu and memory are overwritten by task defaults

Expected Result:

  • ✅ All resources from PodTemplate should be preserved when not explicitly overridden

Root Cause

The issue is in the resource merging logic where PodTemplate resources are not being considered during the container resource customization phase in AddFlyteCustomizationsToContainer.

Proposed Solution

  1. Fetch base pod template early in ApplyFlytePodConfiguration
  2. Extract container resources from the pod template
  3. Pass these resources to an enhanced customization function
  4. Implement proper resource priority merging with the correct precedence order

Impact

In our case, we're using a multi-cluster deployment of Flyte, and our infrastructure is very heterogeneous in terms of resources, storage and network. We want to make sure our "infra" related code (such as PodTemplate) is separated from the Flyte workflow definitions. This is especially convenient where we have a lot of tasks that are occupying a node fully, regardless of the amount of resources there are (e.g in case there is 10 CPUs in nodes - we assign 9 CPUs, in case there is 200 CPUs - we assign 199, and so on).

This affects users who rely on PodTemplates to specify consistent resource requirements across their Flyte tasks, particularly those using specialized resources alongside standard CPU/memory requirements.

Additional context to reproduce

  1. Setup a task with specifying pod_template_name, specifying resources.limits and resources.requests in there.
  2. Upon pod creation, your resources will be treated as unset and will be overwritten by platform (or config, if unset) defaults

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • [x] Yes

Have you read the Code of Conduct?

  • [x] Yes

punkerpunker avatar Jun 03 '25 12:06 punkerpunker

It seems likely that Flyte currently does not preserve CPU and memory resource values from runtime PodTemplates if no explicit inline or override resources are set; instead, these are overwritten by task resource defaults, while extended resources like nvidia.com/gpu and rdma/infiniband are preserved. This behavior is due to the resource merging logic in AddFlyteCustomizationsToContainer, which does not treat PodTemplate resources with the intended priority—platform defaults are applied after PodTemplate values for CPU and memory, but not for extended resources. This is consistent with the current implementation and known issues discussed in the Flyte community and codebase, but there is no explicit documentation or code reference stating that PodTemplate CPU/memory should always take precedence over platform defaults unless overridden by inline or workflow-level overrides.

If you want PodTemplate resources to have higher priority than platform defaults (but lower than explicit overrides), this would require a change in the resource merging logic, as described in your proposed solution. This issue is not fully addressed in the current documentation or code, and the described behavior is a known limitation for users relying on PodTemplates for resource specification in heterogeneous environments. Would you like more detail on the current merging logic or guidance on possible workarounds?

Sources:


This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @runllm to ask follow-up questions. See runllm.com for more!

runllm[bot] avatar Jun 03 '25 12:06 runllm[bot]

Resolved with https://github.com/flyteorg/flyte/pull/6483

punkerpunker avatar Jul 02 '25 00:07 punkerpunker