ROCK-Kernel-Driver incoherent power settings and behaviour of Vega frontier edition

Hi, I was playing with overdrive settings on my Vega Frontier Edition vs rocm-dkms-2.0.89-1.x86_64

I enabled overdrive settings and used rocm-smi to set some settings. By pure luck I'm observing a very strange behaviour. First of all I'm aware of https://github.com/RadeonOpenCompute/ROCm/issues/463#issuecomment-450698247 and https://github.com/RadeonOpenCompute/ROCm/issues/564#issuecomment-428069668 and these are the commands I prepared:

rocm-smi --setfan 60%

rocm-smi --autorespond y --setmlevel 2 1100 903
rocm-smi --autorespond y --setmlevel 3 1100 905

rocm-smi --autorespond y --setslevel 2 992 901
rocm-smi --autorespond y --setslevel 3 993 902
rocm-smi --autorespond y --setslevel 4 994 903
rocm-smi --autorespond y --setslevel 5 995 904
rocm-smi --autorespond y --setslevel 6 996 905
rocm-smi --autorespond y --setslevel 7 1000 906

With these settings my benchmark (ethminer) worked fine initially hitting 41.5MH/s but quickly card was throttled to 37MH/s. smi showed:

========================        ROCm System Management Interface        ========================
================================================================================================
GPU   Temp   AvgPwr   SCLK    MCLK    PCLK           Fan     Perf    PwrCap   SCLK OD   MCLK OD  GPU%
0     69c    207.0W   1000Mhz 1100Mhz 8.0GT/s, x16   66.67%  manual  220W     26806%    16%      100%     
================================================================================================
========================               End of ROCm SMI Log              ========================

Earlier by mistake I tried different (not according to recommendations) settings which turn out to work much better though:

rocm-smi --setfan 60%

rocm-smi --autorespond y --setslevel 5 1000 906
rocm-smi --autorespond y --setslevel 6 1000 906
rocm-smi --autorespond y --setslevel 7 1000 906

rocm-smi --autorespond y --setmlevel 3 1100 950

With the above settings the card works cooler and lower power at 41.5MH/s stable. smi shows:

========================        ROCm System Management Interface        ========================
================================================================================================
GPU   Temp   AvgPwr   SCLK    MCLK    PCLK           Fan     Perf    PwrCap   SCLK OD   MCLK OD  GPU%
0     43c    141.0W   1000Mhz 1100Mhz 8.0GT/s, x16   66.67%  manual  220W     26806%    16%      99%      
================================================================================================
========================               End of ROCm SMI Log              ========================

The only difference is AvgPwr which dropped to 141W from 207 and card is cooler as a result thus performance not thermal throttled.

I have verified that inside /sys/class/drm/card0/device/pp_od_clk_voltage settings are what were expected by my rocm-smi commands (luck_settings.txt vs bad_settings.txt).

So question is why the supposedly incorrect settings work way better than the recommended ones although gpu/mem clocks appear to be the same?

P.S. another interesting thing is why Fan works at 66% instead of 60% as command should set it to?

Jan 01 '19 14:01 akostadinov

For the fan, it was an issue with the SMI taking values at different times and giving different results. That part should be fine in 2.2 now that I handled a race condition where the value and the percent weren't matching up, giving skewed percentages.

As for the incorrect settings, I am unsure as it could just be the quirk of the GPU itself in terms of how it handles the voltages and what "optimal" values are. Maybe @jgreathouse has an idea.

Mar 12 '19 12:03 kentrussell

IIRC, I've observed the same behavior but I have not had the bandwidth to root cause it. Hunting down power idiosyncrasies takes a tremendous amount of work since it crosses half a dozen layers of software and firmware.

Mar 12 '19 15:03 jlgreathouse

I suspect that this is issue with firmware because I remember some strange results doing my ill-fated windows attempts one year ago. Given that difference is more than 25% it must be worth for AMD to invest the resources and figure out what is doing crap. It will be a huge performance/power advantage of AMD if its cards start working more efficiently just by fixing some power profile selection algo or something.

Mar 12 '19 16:03 akostadinov

It's been a while since I tested this, but IIRC I was only able to reproduce the problem when setting custom powerplay tables rather than using the tables we ship in the GPU's VBIOS. This significantly reduces the value proposition, since the vast majority of normal users aren't setting custom powerplay tables.

Mar 12 '19 16:03 jlgreathouse

Do you mean powerplay tables like changing in the driver? I didn't change them. Only run the commands in description. Given different use cases can't be covered well with default power/clock settings, in case user can easily tune them, that will be a huge advantage for AMD. I think for compute use cases this would be used a lot.

25% difference was only between "properly" optimized settings and settings that seem improper but effectively work awesome. If I run on vanilla default settings (no clock setting) card tops at 250W with lower performance. So changing power/clock settings can result in 50% improved characteristics.

IMHO that's worth it. On the other hand I trust that AMD has a clue what must be a priority.

Regards.

Mar 13 '19 08:03 akostadinov