[BUG] Resources from compile-time PodTemplate are ignored when no inline or overrides are specified (replaced with config defaults)
Bug Description
When using Kubernetes PodTemplates with resource specifications as a compile-time PodTemplate, CPU and memory values defined in the PodTemplate are being overwritten by task_resource_defaults, while other resources like nvidia.com/gpu and rdma/infiniband are preserved correctly.
Expected Behavior
Pod template resources should have higher priority than platform defaults but lower priority than explicit resource overrides, following this priority order:
- Resource overrides (highest priority)
- Container inline resources
- PodTemplate resources ← This should be preserved
- Platform/task resource defaults (lowest priority)
Current Behavior
CPU and memory from PodTemplate are being overwritten by task resource defaults, while GPU and RDMA resources work correctly.
Example
Given this PodTemplate resources section:
resources:
requests:
cpu: "55"
memory: "1837Gi"
nvidia.com/gpu: "120"
rdma/infiniband: "63"
limits:
cpu: "55"
memory: "1837Gi"
nvidia.com/gpu: "120"
rdma/infiniband: "63"
Current Result:
- ✅
nvidia.com/gpuandrdma/infinibandare preserved - ❌
cpuandmemoryare overwritten by task defaults
Expected Result:
- ✅ All resources from PodTemplate should be preserved when not explicitly overridden
Root Cause
The issue is in the resource merging logic where PodTemplate resources are not being considered during the container resource customization phase in AddFlyteCustomizationsToContainer.
Proposed Solution
- Fetch base pod template early in
ApplyFlytePodConfiguration - Extract container resources from the pod template
- Pass these resources to an enhanced customization function
- Implement proper resource priority merging with the correct precedence order
Impact
In our case, we're using a multi-cluster deployment of Flyte, and our infrastructure is very heterogeneous in terms of resources, storage and network. We want to make sure our "infra" related code (such as PodTemplate) is separated from the Flyte workflow definitions. This is especially convenient where we have a lot of tasks that are occupying a node fully, regardless of the amount of resources there are (e.g in case there is 10 CPUs in nodes - we assign 9 CPUs, in case there is 200 CPUs - we assign 199, and so on).
This affects users who rely on PodTemplates to specify consistent resource requirements across their Flyte tasks, particularly those using specialized resources alongside standard CPU/memory requirements.
Additional context to reproduce
- Setup a task with specifying
pod_template_name, specifyingresources.limitsandresources.requestsin there. - Upon pod creation, your resources will be treated as unset and will be overwritten by platform (or config, if unset) defaults
Screenshots
No response
Are you sure this issue hasn't been raised already?
- [x] Yes
Have you read the Code of Conduct?
- [x] Yes
It seems likely that Flyte currently does not preserve CPU and memory resource values from runtime PodTemplates if no explicit inline or override resources are set; instead, these are overwritten by task resource defaults, while extended resources like nvidia.com/gpu and rdma/infiniband are preserved. This behavior is due to the resource merging logic in AddFlyteCustomizationsToContainer, which does not treat PodTemplate resources with the intended priority—platform defaults are applied after PodTemplate values for CPU and memory, but not for extended resources. This is consistent with the current implementation and known issues discussed in the Flyte community and codebase, but there is no explicit documentation or code reference stating that PodTemplate CPU/memory should always take precedence over platform defaults unless overridden by inline or workflow-level overrides.
If you want PodTemplate resources to have higher priority than platform defaults (but lower than explicit overrides), this would require a change in the resource merging logic, as described in your proposed solution. This issue is not fully addressed in the current documentation or code, and the described behavior is a known limitation for users relying on PodTemplates for resource specification in heterogeneous environments. Would you like more detail on the current merging logic or guidance on possible workarounds?
Sources:
This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.
Mention @runllm to ask follow-up questions. See runllm.com for more!
Resolved with https://github.com/flyteorg/flyte/pull/6483