1:2.3.4 version dcgm_prometheus.py error AttributeError: 'DcgmPrometheus' object has no attribute 'm_publishFieldIds'
we follow doc here https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-user-guide/integrating-with-dcgm.html#starting-prometheus-client
and looks like the new version of datacenter-gpu-manager has issue for this script:
python3 dcgm_prometheus.py -e Traceback (most recent call last): File "dcgm_prometheus.py", line 264, in
main() File "dcgm_prometheus.py", line 257, in main prometheus_obj.LogBasicInformation() File "dcgm_prometheus.py", line 142, in LogBasicInformation for fieldId in self.m_publishFieldIds: AttributeError: 'DcgmPrometheus' object has no attribute 'm_publishFieldIds'
already install datacenter-gpu-manager Version table: 1:2.3.4 600
not able to find any information when google this... this version just updated Feb 2022 and guess no one use this feature to monitor...
@graywen24,
Unfortunately, the dcgm_prometheus.py is not actively supported and is rather an example. We have the dcgm-exporter project that is meant to provide Prometheus metrics and is actively supported.
@graywen24,
Unfortunately, the dcgm_prometheus.py is not actively supported and is rather an example. We have the dcgm-exporter project that is meant to provide Prometheus metrics and is actively supported.
thanks.. but we dont use k8s cluster and only run offline training on single GPU node... if install dcgm-exporter will be a very heavy process for the node. While node-exporter cant not have gpu monitoring metric..
@graywen24,
dcgm-exporter may work outside of the k8s environment, and in general, that's just a small binary written in Go. If the DCGM is installed on the machine, you do not need to use the dcgm-exporter docker image (just the dcgm-exporter binary) because the libdcgm.so that will be already on the machine.