incoherent power settings and behaviour of Vega frontier edition
Hi, I was playing with overdrive settings on my Vega Frontier Edition vs rocm-dkms-2.0.89-1.x86_64
I enabled overdrive settings and used rocm-smi to set some settings. By pure luck I'm observing a very strange behaviour. First of all I'm aware of https://github.com/RadeonOpenCompute/ROCm/issues/463#issuecomment-450698247 and https://github.com/RadeonOpenCompute/ROCm/issues/564#issuecomment-428069668 and these are the commands I prepared:
rocm-smi --setfan 60%
rocm-smi --autorespond y --setmlevel 2 1100 903
rocm-smi --autorespond y --setmlevel 3 1100 905
rocm-smi --autorespond y --setslevel 2 992 901
rocm-smi --autorespond y --setslevel 3 993 902
rocm-smi --autorespond y --setslevel 4 994 903
rocm-smi --autorespond y --setslevel 5 995 904
rocm-smi --autorespond y --setslevel 6 996 905
rocm-smi --autorespond y --setslevel 7 1000 906
With these settings my benchmark (ethminer) worked fine initially hitting 41.5MH/s but quickly card was throttled to 37MH/s. smi showed:
======================== ROCm System Management Interface ========================
================================================================================================
GPU Temp AvgPwr SCLK MCLK PCLK Fan Perf PwrCap SCLK OD MCLK OD GPU%
0 69c 207.0W 1000Mhz 1100Mhz 8.0GT/s, x16 66.67% manual 220W 26806% 16% 100%
================================================================================================
======================== End of ROCm SMI Log ========================
Earlier by mistake I tried different (not according to recommendations) settings which turn out to work much better though:
rocm-smi --setfan 60%
rocm-smi --autorespond y --setslevel 5 1000 906
rocm-smi --autorespond y --setslevel 6 1000 906
rocm-smi --autorespond y --setslevel 7 1000 906
rocm-smi --autorespond y --setmlevel 3 1100 950
With the above settings the card works cooler and lower power at 41.5MH/s stable. smi shows:
======================== ROCm System Management Interface ========================
================================================================================================
GPU Temp AvgPwr SCLK MCLK PCLK Fan Perf PwrCap SCLK OD MCLK OD GPU%
0 43c 141.0W 1000Mhz 1100Mhz 8.0GT/s, x16 66.67% manual 220W 26806% 16% 99%
================================================================================================
======================== End of ROCm SMI Log ========================
The only difference is AvgPwr which dropped to 141W from 207 and card is cooler as a result thus performance not thermal throttled.
I have verified that inside /sys/class/drm/card0/device/pp_od_clk_voltage settings are what were expected by my rocm-smi commands (luck_settings.txt vs bad_settings.txt).
So question is why the supposedly incorrect settings work way better than the recommended ones although gpu/mem clocks appear to be the same?
P.S. another interesting thing is why Fan works at 66% instead of 60% as command should set it to?
For the fan, it was an issue with the SMI taking values at different times and giving different results. That part should be fine in 2.2 now that I handled a race condition where the value and the percent weren't matching up, giving skewed percentages.
As for the incorrect settings, I am unsure as it could just be the quirk of the GPU itself in terms of how it handles the voltages and what "optimal" values are. Maybe @jgreathouse has an idea.
IIRC, I've observed the same behavior but I have not had the bandwidth to root cause it. Hunting down power idiosyncrasies takes a tremendous amount of work since it crosses half a dozen layers of software and firmware.
I suspect that this is issue with firmware because I remember some strange results doing my ill-fated windows attempts one year ago. Given that difference is more than 25% it must be worth for AMD to invest the resources and figure out what is doing crap. It will be a huge performance/power advantage of AMD if its cards start working more efficiently just by fixing some power profile selection algo or something.
It's been a while since I tested this, but IIRC I was only able to reproduce the problem when setting custom powerplay tables rather than using the tables we ship in the GPU's VBIOS. This significantly reduces the value proposition, since the vast majority of normal users aren't setting custom powerplay tables.
Do you mean powerplay tables like changing in the driver? I didn't change them. Only run the commands in description. Given different use cases can't be covered well with default power/clock settings, in case user can easily tune them, that will be a huge advantage for AMD. I think for compute use cases this would be used a lot.
25% difference was only between "properly" optimized settings and settings that seem improper but effectively work awesome. If I run on vanilla default settings (no clock setting) card tops at 250W with lower performance. So changing power/clock settings can result in 50% improved characteristics.
IMHO that's worth it. On the other hand I trust that AMD has a clue what must be a priority.
Regards.