Daemon Making relief mapping faster

Relief mapping is slow, here are some framerate recorded on my system with all of our presets and for each of them I enabled and disabled relief mapping. This is recorded for a 4K screen resolution on an AMD Radeon PRO W7600.

preset	RM off	RM on
lowest	1250 fps	780fps
low	1100 fps	650 fps
medium	600 fps	330fps
high	580 fps	280 fps
ultra	580 fps	290 fps

This single feature removes 500 fps from the framerate of lowest and low presets, and 300 fps from the framerate of medium, high and ultra presets.

Actually don't ask me why with ultra preset and relief mapping enabled I get 10 fps more than with high and same feature enabled, I tested multiple time and reproduced it. 🤔️

Anyway, the root problem addressed by this issue is the slowness of relief mapping.

The big performance loss between low and lowest presets compared to medium, high and ultra is the enablement of multitexturing with normal mapping, specular mapping, etc. which adds a lot of code and binds many more textures. We may also lose performance in some way we switch between shaders (something the material branch attempts to fix).

But, the relief mapping single handledly consumes as much as performance as all the features enabled in medium presets.

Among possible improvements we may investigate, @SomaZ said this in developper chat channel:

texture2d is usually slow in loops. using textureGrad can improve performance as it doesnt need to determine mips again and again

Here is some Khronos documentation about textureGrad:

https://registry.khronos.org/OpenGL-Refpages/gl4/html/textureGrad.xhtml

May 12 '24 16:05 illwieckz

I forgot to say the scene I used for the benchmark was plat23 default spectator scene.

When running the game at 4K with relief mapping enabled, the game CPU usage is only 32%, so the game is waiting on the GPU.

May 12 '24 16:05 illwieckz

There's probably also a lot of lane divergence and texture cache thrashing with the current implementation. If I understood it correctly, it's looping over the heightmap 16 times in the direction of view origin to find some sort of depth value, then 6 more times around that point to find the closest value in the heightmap or something like that. The first loop is also missing a break;. Shader compilers might be optimising out the other iterations that do nothing, but they also might not.

Sep 27 '24 10:09 VReaperV

Surprisingly, replacing texture2D with textureGrad with constant (for the fragment) derivatives obtained once from dFdx and dFdy actually made performance a lot worse, though I could've made an error somewhere. Maybe this results in a lot of texels being accessed at a lower mip-level?

Sep 29 '24 08:09 VReaperV

Is this fixed by 83244ed8442530e2776d4297b66dbb7a1ceacda6? It was claimed to be worth "hundreds of FPS" in the 0.55 release post.

Oct 23 '24 20:10 slipher

In some way, yes, but the initial issue was not about a bug hurting performance, but about the possibility to rewrite the code using faster functions, though @VReaperV tried and said:

Surprisingly, replacing texture2D with textureGrad with constant (for the fragment) derivatives obtained once from dFdx and dFdy actually made performance a lot worse, though I could've made an error somewhere. Maybe this results in a lot of texels being accessed at a lower mip-level?

I wonder if there was an error or not in that attempt…

Oct 23 '24 20:10 illwieckz