nvtop icon indicating copy to clipboard operation
nvtop copied to clipboard

[ROCM-SMI] Does not recognize my CDNA GPU (MI300)

Open tjk213 opened this issue 1 year ago • 5 comments

does nvtop 3.1.0 support MI300? rocm-smi recognizes the cards but nvtop says No GPU to monitor.

build ❯ ./src/nvtop -v
nvtop version 3.1.0
build ❯ ./src/nvtop
No GPU to monitor.
build ❯ rocm-smi


============================================ ROCm System Management Interface ============================================
====================================================== Concise Info ======================================================
Device  Node  IDs              Temp        Power     Partitions          SCLK    MCLK    Fan  Perf  PwrCap  VRAM%  GPU%
              (DID,     GUID)  (Junction)  (Socket)  (Mem, Compute, ID)
==========================================================================================================================
0       2     0x74a1,   32700  39.0°C      134.0W    NPS1, SPX, 0        132Mhz  900Mhz  0%   auto  750.0W  0%     0%
1       3     0x74a1,   3884   41.0°C      139.0W    NPS1, SPX, 0        131Mhz  900Mhz  0%   auto  750.0W  0%     0%
2       4     0x74a1,   29122  37.0°C      130.0W    NPS1, SPX, 0        132Mhz  900Mhz  0%   auto  750.0W  0%     0%
3       5     0x74a1,   35464  43.0°C      139.0W    NPS1, SPX, 0        131Mhz  900Mhz  0%   auto  750.0W  0%     0%
4       6     0x74a1,   46166  37.0°C      133.0W    NPS1, SPX, 0        132Mhz  900Mhz  0%   auto  750.0W  10%    0%
5       7     0x74a1,   64654  41.0°C      141.0W    NPS1, SPX, 0        132Mhz  900Mhz  0%   auto  750.0W  0%     0%
6       8     0x74a1,   4769   42.0°C      136.0W    NPS1, SPX, 0        132Mhz  900Mhz  0%   auto  750.0W  0%     0%
7       9     0x74a1,   6315   35.0°C      129.0W    NPS1, SPX, 0        132Mhz  900Mhz  0%   auto  750.0W  0%     0%
==========================================================================================================================
================================================== End of ROCm SMI Log ===================================================

tjk213 avatar Dec 12 '24 21:12 tjk213

Unfortunately nvtop doesn't support the ROCm SMI interface yet. For now only the AMD GPUs that are exposed though the DRM kernel interface will be displayed by nvtop.

It seems that the MI300 is mostly geared towards "compute" workloads such as ML and probably has no use as a graphical GPU.

Syllo avatar Dec 13 '24 19:12 Syllo

although it support MI250X which is the same class. ( but older gen )as the MI300X wouldnt adding the PCI ids into the amdgpu_ids.h file to get it to be recognized, when amdgpu driver is installed, it also allows DRM kernel interface to interogate the MI300X. I will see if I can test this.

kbuggenhout avatar Jan 11 '25 22:01 kbuggenhout

looks like this was an environment/setup problem rather than an nvtop issue. I added my user to the video and host-render groups, and it works now.

tjk213 avatar Jan 14 '25 21:01 tjk213

looks like this was an environment/setup problem rather than an nvtop issue. I added my user to the video and host-render groups, and it works now.

It didn't really worked for me.

apt install nvtop # success
sudo usermod -aG video $USER
sudo usermod -aG host-render $USER

nvtop # still hits the no gpu error

and the rocm-smi is working correctly.

GindaChen avatar Mar 29 '25 10:03 GindaChen

I started working on rocm-smi library support some time ago, it may make it in the next release!

Syllo avatar Mar 29 '25 11:03 Syllo