[Feature Request] Add hostNetwork mode for dcgmExporter
Hello, NVIDIA Team.
I'm facing an issue while configurating dcgm-exporter from gpu-operator. I have 2 Kubernetes clusters - one is a cluster where GPU jobs run, and the other is used for managing the first cluster. In this case, Prometheus is not installed on the cluster where GPU jobs run, due to reduce CPU and memory resources as much can, and I'm trying to collect metrics using Prometheus from the other cluster.
I hope to set hostNetwork service for dcgm-exporter in order to get metrics from each nodes, but I can't find where it should be placed in gpu-operator helm chart (As I remembered this is useful when Prometheus is deployed outside of the Kubernetes cluster).
I found that hostNetwork can be configurable in dcgm-exporter, for example:
spec:
{{- if .Values.runtimeClassName }}
runtimeClassName: {{ .Values.runtimeClassName }}
{{- end }}
priorityClassName: {{ .Values.priorityClassName | default "system-node-critical" }}
{{- if .Values.hostNetwork }}
hostNetwork: {{ .Values.hostNetwork }}
https://github.com/NVIDIA/dcgm-exporter/blob/4cc1d199cd3b13b6edee96af5339708f9747f499/deployment/templates/daemonset.yaml#L53
But in gpu-operator, only below values can be configurable and can't modify Service in here:
dcgmExporter:
enabled: true
repository: nvcr.io/nvidia/k8s
image: dcgm-exporter
version: 3.3.8-3.6.0-ubuntu22.04
imagePullPolicy: IfNotPresent
env:
- name: DCGM_EXPORTER_LISTEN
value: ":9400"
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
- name: DCGM_EXPORTER_COLLECTORS
value: "/etc/dcgm-exporter/dcp-metrics-included.csv"
resources: {}
serviceMonitor:
enabled: false
interval: 15s
honorLabels: false
additionalLabels: {}
relabelings: []
https://github.com/NVIDIA/gpu-operator/blob/752e8aed73c8c6141b545f56a0ed23e2a2b637a7/deployments/gpu-operator/values.yaml#L309C1-L328C20
Besides, there isn't configurable section in DaemonSet:
https://github.com/NVIDIA/gpu-operator/blob/752e8aed73c8c6141b545f56a0ed23e2a2b637a7/assets/state-dcgm-exporter/0900_daemonset.yaml
So in this case, Could you please add hostNetwork option in dcgmExporter section?
Thanks.
Instead of enabling hostNetwork, would making the dcgm-exporter service a NodePort unblock you? If so, we can look into making the dcgm service configurable.
I don't think NodePort is suitable for every scenario, as not all of Kubernetes clusters necessarily install or use kube-proxy, I would be more compatible if hostNetwork mode is available.
Instead of enabling
hostNetwork, would making the dcgm-exporter service aNodePortunblock you? If so, we can look into making the dcgm service configurable.
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.
We have a similar need where a prometheus-server
- runs outside the k8s cluster running
gpu-operator(anddcgm-exporterdaemonset etc.) - wants to scrape metrics from every node (running this daemonset pod) of this k8s cluster
- IIUC,
NodePortservice would only target one node at any time, prometheus-server won't use that service to scrape (prometheus-server would userole: nodescrape config to discover the nodes), - in this case,
NodePortservice would only be useful to have the "side-effect" of exposing the node, wasting the service object.
- IIUC,
Here's a PR to support this ability (using hostNetwork approach): https://github.com/NVIDIA/gpu-operator/pull/1962
Here's a PR to support this ability (using
hostNetworkapproach): #1962
Hi @tariq1890 , when you get a chance, mind taking a look at this PR? I also need help figuring out why the CI checks are stuck pending for almost a day now, is there something should be doing to kick them off? (Only the DCO check ran.)