kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Bug] KubeRay Operator pod fails to start when using --enable-metrics with helm chart v1.3.2

Open cmontemuino opened this issue 11 months ago • 7 comments

Search before asking

  • [x] I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

The KubeRay Operator deployment fails when including --enable-metrics in the argument list.

This is derived from following lines in the deployment.yaml file:

https://github.com/ray-project/kuberay/blob/bc2e2c6bb0363ae17a32e4f3a3afb0dd2555c573/helm-chart/kuberay-operator/templates/deployment.yaml#L108-L110

Reproduction script

Example arguments:

- args:
    - >-
      --feature-gates=RayClusterStatusConditions=true,RayJobDeletionPolicy=false
    - '--enable-leader-election=true'
    - '--enable-metrics=true'

Pod fails to start:

flag provided but not defined: -enable-metrics
Usage of /manager:
  -batch-scheduler string
    	Batch scheduler name, supported values are volcano and yunikorn.
  -config string
    	Path to structured config file. Flags are ignored if config file is set.
  -enable-batch-scheduler
    	(Deprecated) Enable batch scheduler. Currently is volcano, which supports gang scheduler policy. Please use --batch-scheduler instead.
  -enable-leader-election
    	Enable leader election for controller manager. Enabling this will ensure there is only one active controller manager. (default true)
  -feature-gates string
    	A set of key=value pairs that describe feature gates. E.g. FeatureOne=true,FeatureTwo=false,...
  -forced-cluster-upgrade
    	(Deprecated) Forced cluster upgrade flag
  -health-probe-bind-address string
    	The address the probe endpoint binds to. (default ":8082")
  -kubeconfig string
    	Paths to a kubeconfig. Only required if out-of-cluster.
  -leader-election-namespace string
    	Namespace where the leader election resource lives. Defaults to the pod namespace if not set.
  -log-file-encoder string
    	Encoder to use for log file. Valid values are 'json' and 'console'. Defaults to 'json' (default "json")
  -log-file-path string
    	Synchronize logs to local file
  -log-stdout-encoder string
    	Encoder to use for logging stdout. Valid values are 'json' and 'console'. Defaults to 'json' (default "json")
  -metrics-addr string
    	The address the metric endpoint binds to. (default ":8080")
  -reconcile-concurrency int
    	max concurrency for reconciling (default 1)
  -use-kubernetes-proxy
    	Use Kubernetes proxy subresource when connecting to the Ray Head node.
  -watch-namespace string
    	Specify a list of namespaces to watch for custom resources, separated by commas. If left empty, all namespaces will be watched.
  -zap-devel
    	Development Mode defaults(encoder=consoleEncoder,logLevel=Debug,stackTraceLevel=Warn). Production Mode defaults(encoder=jsonEncoder,logLevel=Info,stackTraceLevel=Error)
  -zap-encoder value
    	Zap log encoding (one of 'json' or 'console')
  -zap-log-level value
    	Zap Level to configure the verbosity of logging. Can be one of 'debug', 'info', 'error', or any integer value > 0 which corresponds to custom debug levels of increasing verbosity
  -zap-stacktrace-level value
    	Zap Level at and above which stacktraces are captured (one of 'info', 'error', 'panic').
  -zap-time-encoding value
    	Zap time encoding (one of 'epoch', 'millis', 'nano', 'iso8601', 'rfc3339' or 'rfc3339nano'). Defaults to 'epoch'.

Anything else

No response

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

cmontemuino avatar May 22 '25 07:05 cmontemuino

thank you for reporting. Let me take a look

troychiu avatar May 22 '25 17:05 troychiu

It' weird that the flag support was not checked in to 1.3.2. In this PR, I added both the helm chart and the operator flag support. However, when I checked the 1.3.2 tag, only helm chart update is there but not the operator side. @kevin85421 do you see any potential issue with the 1.3.2 release?

troychiu avatar May 25 '25 03:05 troychiu

Sorry I was wrong. It looks like helm chart v1.3.2 also do not have this flag. Can you confirm which version of the helm chart you were using? Thank you!

troychiu avatar May 25 '25 03:05 troychiu

Sorry I was wrong. It looks like helm chart v1.3.2 also do not have this flag. Can you confirm which version of the helm chart you were using? Thank you!

I'm using v1.3.2

cmontemuino avatar May 26 '25 09:05 cmontemuino

I think you need to follow the steps in https://github.com/ray-project/kuberay/blob/master/ray-operator/DEVELOPMENT.md#run-the-operator-inside-the-cluster to run latest version Operator and helm chart, because the Helm chart v1.3.2 does not support the --enable-metrics flag.

this flag support in v1.4.0 or later.

win5923 avatar May 28 '25 03:05 win5923

I think you need to follow the steps in https://github.com/ray-project/kuberay/blob/master/ray-operator/DEVELOPMENT.md#run-the-operator-inside-the-cluster to run latest version Operator and helm chart, because the Helm chart v1.3.2 does not support the --enable-metrics flag.

this flag support in v1.4.0 or later.

But the helm chart should work instead of failing to start.

Let me try it out with v1.3.2

troychiu avatar May 28 '25 07:05 troychiu

Hi @cmontemuino I just installed kuberay operator with helm chart v1.3.2 and things work fine. Below is the command that I tested.

kind create cluster --image=kindest/node:v1.26.0
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
helm install kuberay-operator kuberay/kuberay-operator --version 1.3.2

and there is no --enable-metrics in the deploy container args

spec:
    containers:
    - args:
      - --feature-gates=RayClusterStatusConditions=true,RayJobDeletionPolicy=false
      - --enable-leader-election=true

Do you mind sharing your steps to reproduce the issue? Thank you!

troychiu avatar May 29 '25 06:05 troychiu