flyte icon indicating copy to clipboard operation
flyte copied to clipboard

[BUG] unset CPU limit is overridden with CPU request

Open flixr opened this issue 2 years ago • 8 comments

Describe the bug

A container task with only CPU requests but no limits set, still gets limits applied:

from flytekit import ContainerTask, kwtypes, workflow, Resources

hello_task = ContainerTask(
    name="hello",
    image="ubuntu:20.04",
    requests=Resources(cpu="2", mem="1Gi"),
    limits=Resources(mem="2Gi"),
    command=["echo", "hello"]
)

Running this task results in a pod with cpu limit also set to 2 (same as request), but there should be no cpu limit.

Expected behavior

If no CPU limit is specified, it is also not set implicitly and no CPU limit is applied in k8s.

So I propose to only copy the requests to limits as a whole if limits are completely unset instead of filling missing limits with the respective requests values.

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • [X] Yes

Have you read the Code of Conduct?

  • [X] Yes

flixr avatar Apr 06 '23 12:04 flixr

We have just encountered this today also; in our case request is being set at the task level:

Here is the schema from flytectl's point of view; flytectl get workflow -p flytesnacks -d development test_workflow -o json --version test_version:

... snipped to the section of interest...
          "resources": {
            "requests": [
              {
                "name": "CPU",
                "value": "64"
              },
              {
                "name": "MEMORY",
                "value": "950Gi"
              }
            ]
          }

pod spec:

resources:                                                                                                                                                                                                                                                                                                                                                                                                                                                    
  limits:
    cpu: "64"
    memory: 950Gi
  requests:
    cpu: "64"
    memory: 950Gi

ossareh avatar Apr 11 '23 16:04 ossareh

For more context: In many cases setting CPU limits (not requests) results in unused CPU due to throttling, see also https://home.robusta.dev/blog/stop-using-cpu-limits.

flixr avatar May 04 '23 18:05 flixr

Need to explore this further, but Flyte does this because of k8s pod QoS. If the requests are different than the limits then k8s frequently preempts the pods to schedule others. By setting them the same, k8s will not prematurely delete the pod to schedule another.

hamersaw avatar Dec 01 '23 22:12 hamersaw

We were wondering if it is the right decision to set it by default instead of letting the user decide.

In our case most tasks are scheduled on exclusive nodes. Without manually overriding the limits, currently users would need to get very close to the maximum CPU and memory request of the node. Without this, their Pod will be throttled and see OOM earlier than necessary.

We currently work around this by automatically overriding the limits on every task.

fellhorn avatar Dec 05 '23 15:12 fellhorn

We were wondering if it is the right decision to set it by default instead of letting the user decide.

My 2 cents here is that letting the user decide, rather than enforcing an unintuitive behavior, is never a bad choice.

Sure there can be a default behavior, but the user should be given the option to leave the limits unset (unlimited) depending on the use case. And, as @flixr pointed out, there are good reasons to do it sometimes.

leoll2 avatar Jul 08 '24 11:07 leoll2

"Hello 👋, this issue has been inactive for over 90 days. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏"

github-actions[bot] avatar May 15 '25 00:05 github-actions[bot]

Bump, still a very annoying issue that prevents us from fully utilizing our cluster!

flixr avatar May 15 '25 06:05 flixr

We are thinking of removing this limit. There is a noisy neighbor problem that happens and then causes non deterministic failures. Cc @EngHabu

kumare3 avatar May 15 '25 13:05 kumare3

"Hello 👋, this issue has been inactive for over 90 days. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏"

github-actions[bot] avatar Aug 15 '25 00:08 github-actions[bot]

Hello 👋, this issue has been inactive for over 90 days and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏

github-actions[bot] avatar Aug 23 '25 00:08 github-actions[bot]