flyte [Core feature] Allow different resource configs to ray worker and head

Motivation: Why do you think this is important?

Currently, ray workers and head both use the same pod template so that they will be launched with the same pod resources at runtime. However, in some cases, users only want GPU on the worker nodes but not the head. Other times, users want to create two groups: one with CPUd and another with GPUs.

Goal: What should the final outcome look like, ideally?

Users can pass different configs to the ray worker and head.

For example:

ray_config = RayJobConfig(
    head_node_config=HeadNodeConfig(
        requests=Resources(mem="64Gi", cpu="4"),
        limits=Resources(mem="64Gi", cpu="4")
    ),
    worker_node_config=[
        WorkerNodeConfig(
            group_name="cpu-group",
            replicas=4,
            requests=Resources(mem="256Gi", cpu="64"),
            limits=Resources(mem="256Gi", cpu="64"),
        ),
        WorkerNodeConfig(
            group_name="gpu-group",
            replicas=2,
            requests=Resources(mem="480Gi", cpu="60", gpu="2"),
            limits=Resources(mem="480Gi", cpu="60", gpu="2")
        )
    ],
)

Describe alternatives you've considered

no alternative

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

[X] Yes

Have you read the Code of Conduct?

[X] Yes

Nov 07 '23 18:11 ByronHsu

This looks like a reasonable feature. Thanks for raising it, @ByronHsu !

This would be a great first issue to work on.

Nov 09 '23 21:11 eapolinario

Me or @troychiu will be on it

Nov 10 '23 01:11 ByronHsu

This is relevant to other plugins that follow a driver-worker pattern (e.g. ray, spark, kfoperator).

Nov 15 '23 00:11 jeevb

FYI: created an additional issue wich is very related to this one: https://github.com/flyteorg/flyte/issues/4674

Its extending the idea by also allowing pod specifications.

Jan 04 '24 10:01 vkaiser-mb

@ByronHsu Any updates? What do you think about my extension?

Jan 15 '24 10:01 vkaiser-mb

We'll regroup in re-prioritize this feature in the coming sprints.

Apr 29 '24 17:04 eapolinario