Making relief mapping faster
Relief mapping is slow, here are some framerate recorded on my system with all of our presets and for each of them I enabled and disabled relief mapping. This is recorded for a 4K screen resolution on an AMD Radeon PRO W7600.
| preset | RM off | RM on |
|---|---|---|
| lowest | 1250 fps | 780fps |
| low | 1100 fps | 650 fps |
| medium | 600 fps | 330fps |
| high | 580 fps | 280 fps |
| ultra | 580 fps | 290 fps |
This single feature removes 500 fps from the framerate of lowest and low presets, and 300 fps from the framerate of medium, high and ultra presets.
Actually don't ask me why with ultra preset and relief mapping enabled I get 10 fps more than with high and same feature enabled, I tested multiple time and reproduced it. 🤔️
Anyway, the root problem addressed by this issue is the slowness of relief mapping.
The big performance loss between low and lowest presets compared to medium, high and ultra is the enablement of multitexturing with normal mapping, specular mapping, etc. which adds a lot of code and binds many more textures. We may also lose performance in some way we switch between shaders (something the material branch attempts to fix).
But, the relief mapping single handledly consumes as much as performance as all the features enabled in medium presets.
Among possible improvements we may investigate, @SomaZ said this in developper chat channel:
texture2d is usually slow in loops. using textureGrad can improve performance as it doesnt need to determine mips again and again
Here is some Khronos documentation about textureGrad:
- https://registry.khronos.org/OpenGL-Refpages/gl4/html/textureGrad.xhtml
I forgot to say the scene I used for the benchmark was plat23 default spectator scene.
When running the game at 4K with relief mapping enabled, the game CPU usage is only 32%, so the game is waiting on the GPU.
There's probably also a lot of lane divergence and texture cache thrashing with the current implementation. If I understood it correctly, it's looping over the heightmap 16 times in the direction of view origin to find some sort of depth value, then 6 more times around that point to find the closest value in the heightmap or something like that. The first loop is also missing a break;. Shader compilers might be optimising out the other iterations that do nothing, but they also might not.
Surprisingly, replacing texture2D with textureGrad with constant (for the fragment) derivatives obtained once from dFdx and dFdy actually made performance a lot worse, though I could've made an error somewhere. Maybe this results in a lot of texels being accessed at a lower mip-level?
Is this fixed by 83244ed8442530e2776d4297b66dbb7a1ceacda6? It was claimed to be worth "hundreds of FPS" in the 0.55 release post.
In some way, yes, but the initial issue was not about a bug hurting performance, but about the possibility to rewrite the code using faster functions, though @VReaperV tried and said:
Surprisingly, replacing
texture2DwithtextureGradwith constant (for the fragment) derivatives obtained once fromdFdxanddFdyactually made performance a lot worse, though I could've made an error somewhere. Maybe this results in a lot of texels being accessed at a lower mip-level?
I wonder if there was an error or not in that attempt…