ROCmValidationSuite icon indicating copy to clipboard operation
ROCmValidationSuite copied to clipboard

[Feature]: Parity With NVIDIA DCGM - Pulse Test

Open functionstackx opened this issue 7 months ago • 4 comments

Suggestion Description

To catch reliability issues earlier, NVIDIA DCGM has an advanced test for creating spikes in the current flow on the board to ensure the VRM & PSU can handle fluctuations (which may be caused by cpu kernel launch bound applications or kernels that are natively high micro-fluctuation).

I have searched all of ROCmValidationSuite's documentation & codebase and haven't found anything related to this

https://rocm.docs.amd.com/projects/ROCmValidationSuite/en/latest/conceptual/rvs-modules.html

When you get the chance, can u look into implementing this?

cc: @hliuca

Excerpt About Pulse Test from https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html#pulse-test-diagnostic

The Pulse Test is part of the new level 4 tests. The pulse test is meant to fluctuate the power usage to create spikes in current flow on the board to ensure that the power supply is fully functional and can handle wide fluctuations in current.

By default, the test runs kernels with high transiency in order to create spikes in the current running to the GPU. Default parameters have been verified to create worst-case scenario failures by measuring with oscilloscopes.

The test iteratively runs different kernels while tweaking internal parameters to ensure that spikes are produced; work across GPU is synchronized to create extra stress on the power supply.

Operating System

No response

GPU

No response

ROCm Component

No response

functionstackx avatar Jul 08 '25 00:07 functionstackx

@functionstackx internal ticket created, SWDEV-542282, thank you.

hliuca avatar Jul 08 '25 16:07 hliuca

@functionstackx we have some update.

Currently, pulse tests (current spike tests) are implemented internally.

Just wanted to be clear on the ask here:

  1. Pulse tests specific to each platform as per hardware (max. current) specifications ?

Thank you.

hliuca avatar Jul 28 '25 20:07 hliuca

@hliuca by internally, do u mean in an closed source package? is there any way to gain access to that or is there any way that it could be open sourced?

yes, it seems like dcgm does it for each platform

functionstackx avatar Jul 28 '25 20:07 functionstackx

Hi @functionstackx it is in internal repo, not merged to external yet. let me check if we can release. also, i pass the info to dev team. thank you.

hliuca avatar Jul 28 '25 20:07 hliuca