dcgm-exporter README docker instructions contains incorrect commands and information
The dcgm-exporter README.md has incorrect information about running dcgm-exporter in Docker. There are 2 major problems with these instructions which we would appreciate you fix.
-
In the Docker section, you indicate that we should create a counters csv file with specific fields that you suggest should be used. Unfortunately using that counters file with the most recent version of the dcgm-exporter docker image (3.3.5-3.4.1) causes a segmentation violation:
time="2024-04-09T21:14:41Z" level=info msg="Initializing system entities of type: CPU" SIGSEGV: segmentation violationIf I provide no counters.csv file to the docker command it works fine. (For example using no
-vargument in the recommended command in your step 2 here.) -
Again in your recommended
docker runcommand, you suggest using-e DCGM_EXPORTER_INTERVAL=3which tells dcgm-exporter to read GPU metrics every 3 milliseconds. This is apparently too fast, and causes high CPU usage, which I found out when I opened this issue in the dcgm-exporter repository. The default is-e DCGM_EXPORTER_INTERVAL=30000, which does not cause a high CPU usage problem on the system
These two issues cause the dcgm-exporter to be unusable due to your suggested commands and usage. Please fix this documentation.
@mbacchi Sorry for the delay. We recently addressed point 2 here: https://github.com/DataDog/integrations-core/pull/18658
Internally testing we saw some better stability in CPU usage with as you mentioned a higher exporter interval. But we wanted to try aligning the interval closer to what the Datadog agent scrapes on to help prevent stale data.
For point 1: I couldn't replicate this behavior with 3.3.8-3.6.0-ubuntu22.04 or 3.3.5-3.4.0-ubuntu22.04 for me they seem to spin up fine and I can't get the segfault you encountered:
time="2024-09-27T18:11:44Z" level=info msg="Initializing system entities of type: CPU"
time="2024-09-27T18:11:44Z" level=info msg="Not collecting CPU metrics; Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
time="2024-09-27T18:11:44Z" level=info msg="Initializing system entities of type: CPU Core"
time="2024-09-27T18:11:44Z" level=info msg="Not collecting CPU Core metrics; Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
👋 Closing this for now. I updated the config error and I couldn't replicate the other point. Please feel free to reopen if needed.