[Core feature] Allow different resource configs to ray worker and head
Motivation: Why do you think this is important?
Currently, ray workers and head both use the same pod template so that they will be launched with the same pod resources at runtime. However, in some cases, users only want GPU on the worker nodes but not the head. Other times, users want to create two groups: one with CPUd and another with GPUs.
Goal: What should the final outcome look like, ideally?
Users can pass different configs to the ray worker and head.
For example:
ray_config = RayJobConfig(
head_node_config=HeadNodeConfig(
requests=Resources(mem="64Gi", cpu="4"),
limits=Resources(mem="64Gi", cpu="4")
),
worker_node_config=[
WorkerNodeConfig(
group_name="cpu-group",
replicas=4,
requests=Resources(mem="256Gi", cpu="64"),
limits=Resources(mem="256Gi", cpu="64"),
),
WorkerNodeConfig(
group_name="gpu-group",
replicas=2,
requests=Resources(mem="480Gi", cpu="60", gpu="2"),
limits=Resources(mem="480Gi", cpu="60", gpu="2")
)
],
)
Describe alternatives you've considered
no alternative
Propose: Link/Inline OR Additional context
No response
Are you sure this issue hasn't been raised already?
- [X] Yes
Have you read the Code of Conduct?
- [X] Yes
This looks like a reasonable feature. Thanks for raising it, @ByronHsu !
This would be a great first issue to work on.
Me or @troychiu will be on it
This is relevant to other plugins that follow a driver-worker pattern (e.g. ray, spark, kfoperator).
FYI: created an additional issue wich is very related to this one: https://github.com/flyteorg/flyte/issues/4674
Its extending the idea by also allowing pod specifications.
@ByronHsu Any updates? What do you think about my extension?
We'll regroup in re-prioritize this feature in the coming sprints.