Nathan J. Williams
Nathan J. Williams
I agree that it seems like Xid 94 is essentially an application error and should not disable the device. But as a workaround you can tell it to ignore this...
I'm seeing the same thing, momentary spikes in the DCGM_FI_DEV_GPU_UTIL metric to about 800 million - (808793649 in one recent capture) for intervals somewhere between 15 and 45 seconds (I...
I ran `dcgmi dmon -e 203 -d 5000` on all of my fleet hosts for the past day, and never saw numbers out of the range of 0-100. During that...
This looks a lot like https://github.com/NVIDIA/gpu-operator/issues/421 - the image you're using is has `NVIDIA_VISIBLE_DEVICES=all` set*, and the plugin is respecting that. You may want to follow the directions in that...