gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

[Feature Request] Add hostNetwork mode for dcgmExporter

Open jslouisyou opened this issue 1 year ago • 5 comments

Hello, NVIDIA Team.

I'm facing an issue while configurating dcgm-exporter from gpu-operator. I have 2 Kubernetes clusters - one is a cluster where GPU jobs run, and the other is used for managing the first cluster. In this case, Prometheus is not installed on the cluster where GPU jobs run, due to reduce CPU and memory resources as much can, and I'm trying to collect metrics using Prometheus from the other cluster.

I hope to set hostNetwork service for dcgm-exporter in order to get metrics from each nodes, but I can't find where it should be placed in gpu-operator helm chart (As I remembered this is useful when Prometheus is deployed outside of the Kubernetes cluster).

I found that hostNetwork can be configurable in dcgm-exporter, for example:

    spec:
      {{- if .Values.runtimeClassName }}
      runtimeClassName: {{ .Values.runtimeClassName }}
      {{- end }}
      priorityClassName: {{ .Values.priorityClassName | default "system-node-critical" }}
      {{- if .Values.hostNetwork }}
      hostNetwork: {{ .Values.hostNetwork }}

https://github.com/NVIDIA/dcgm-exporter/blob/4cc1d199cd3b13b6edee96af5339708f9747f499/deployment/templates/daemonset.yaml#L53

But in gpu-operator, only below values can be configurable and can't modify Service in here:

dcgmExporter:
  enabled: true
  repository: nvcr.io/nvidia/k8s
  image: dcgm-exporter
  version: 3.3.8-3.6.0-ubuntu22.04
  imagePullPolicy: IfNotPresent
  env:
    - name: DCGM_EXPORTER_LISTEN
      value: ":9400"
    - name: DCGM_EXPORTER_KUBERNETES
      value: "true"
    - name: DCGM_EXPORTER_COLLECTORS
      value: "/etc/dcgm-exporter/dcp-metrics-included.csv"
  resources: {}
  serviceMonitor:
    enabled: false
    interval: 15s
    honorLabels: false
    additionalLabels: {}
    relabelings: []

https://github.com/NVIDIA/gpu-operator/blob/752e8aed73c8c6141b545f56a0ed23e2a2b637a7/deployments/gpu-operator/values.yaml#L309C1-L328C20

Besides, there isn't configurable section in DaemonSet: https://github.com/NVIDIA/gpu-operator/blob/752e8aed73c8c6141b545f56a0ed23e2a2b637a7/assets/state-dcgm-exporter/0900_daemonset.yaml

So in this case, Could you please add hostNetwork option in dcgmExporter section?

Thanks.

jslouisyou avatar Oct 30 '24 11:10 jslouisyou

Instead of enabling hostNetwork, would making the dcgm-exporter service a NodePort unblock you? If so, we can look into making the dcgm service configurable.

tariq1890 avatar Nov 20 '24 17:11 tariq1890

I don't think NodePort is suitable for every scenario, as not all of Kubernetes clusters necessarily install or use kube-proxy, I would be more compatible if hostNetwork mode is available.

Instead of enabling hostNetwork, would making the dcgm-exporter service a NodePort unblock you? If so, we can look into making the dcgm service configurable.

runzhliu avatar Aug 10 '25 12:08 runzhliu

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

github-actions[bot] avatar Nov 09 '25 04:11 github-actions[bot]

We have a similar need where a prometheus-server

  • runs outside the k8s cluster running gpu-operator (and dcgm-exporter daemonset etc.)
  • wants to scrape metrics from every node (running this daemonset pod) of this k8s cluster
    • IIUC, NodePort service would only target one node at any time, prometheus-server won't use that service to scrape (prometheus-server would use role: node scrape config to discover the nodes),
    • in this case, NodePort service would only be useful to have the "side-effect" of exposing the node, wasting the service object.

Here's a PR to support this ability (using hostNetwork approach): https://github.com/NVIDIA/gpu-operator/pull/1962

nikhaild avatar Dec 03 '25 07:12 nikhaild

Here's a PR to support this ability (using hostNetwork approach): #1962

Hi @tariq1890 , when you get a chance, mind taking a look at this PR? I also need help figuring out why the CI checks are stuck pending for almost a day now, is there something should be doing to kick them off? (Only the DCO check ran.)

nikhaild avatar Dec 04 '25 00:12 nikhaild