[Feature]: Parity With NVIDIA DCGM - Pulse Test
Suggestion Description
To catch reliability issues earlier, NVIDIA DCGM has an advanced test for creating spikes in the current flow on the board to ensure the VRM & PSU can handle fluctuations (which may be caused by cpu kernel launch bound applications or kernels that are natively high micro-fluctuation).
I have searched all of ROCmValidationSuite's documentation & codebase and haven't found anything related to this
https://rocm.docs.amd.com/projects/ROCmValidationSuite/en/latest/conceptual/rvs-modules.html
When you get the chance, can u look into implementing this?
cc: @hliuca
Excerpt About Pulse Test from https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html#pulse-test-diagnostic
The Pulse Test is part of the new level 4 tests. The pulse test is meant to fluctuate the power usage to create spikes in current flow on the board to ensure that the power supply is fully functional and can handle wide fluctuations in current.
By default, the test runs kernels with high transiency in order to create spikes in the current running to the GPU. Default parameters have been verified to create worst-case scenario failures by measuring with oscilloscopes.
The test iteratively runs different kernels while tweaking internal parameters to ensure that spikes are produced; work across GPU is synchronized to create extra stress on the power supply.
Operating System
No response
GPU
No response
ROCm Component
No response
@functionstackx internal ticket created, SWDEV-542282, thank you.
@functionstackx we have some update.
Currently, pulse tests (current spike tests) are implemented internally.
Just wanted to be clear on the ask here:
- Pulse tests specific to each platform as per hardware (max. current) specifications ?
Thank you.
@hliuca by internally, do u mean in an closed source package? is there any way to gain access to that or is there any way that it could be open sourced?
yes, it seems like dcgm does it for each platform
Hi @functionstackx it is in internal repo, not merged to external yet. let me check if we can release. also, i pass the info to dev team. thank you.