Alternate lazy shader loading implementation
Right now we have 3 kinds of GLSL shader loading implementation (r_lazyShaders):
- 0: build all shaders at game start before reaching the main menu (old default): it builds everything before displaying something useful and the user even don't know what's happening, the main menu may not be reachable if an advanced shader doesn't work even if unused by the main menu.
- 1: build shaders on demand until map loads, then build everything else at map load (current default) make sure only the shaders required to display the main menu are built before displaying the main menu, makes fast game load, and fast preset selection in main menu, makes possible to select a lower preset in main menu a shader would fail later at map load,
- 2: build all shaders on demand, makes sure no unused shader is built, a shader may be built while playing the game the first time the texture requiring it is displayed, producing stuttering
Option 1 is a good enough trade-off, but we still build many shaders for nothing. For example we will build the relief mapping and the deluxe mapping permutations even when loading a Tremulous map without such feature.
With option 1 the engine will also build any shader permutation for unused features, for example if we merge #258 that makes possible to generate normal maps from height maps at render time if normal maps are missing, that feature and all the permutations (with or without deluxe map, with lightmap, with grid lighting…) will be built even if there is no asset missing a normal map in the currently played map.
With 0.55, the engine actually selects the rendering function at q3shader load time, meaning we may be able to just implement a loop that would render all the textures on some unseen framebuffer at map load time, triggering the GLSL shader build. That means such lazy shader loading would exhaustively build all required GLSL shaders at map load time, without missing used ones, and without building unused permutations.
This would allow us to worry less about adding more GLSL shader permutations, because the amount of shader permutations per feature is basically N^N^N^N… Our only way to avoid building crazy amounts of permutations is that we know some permutations combinations can't happen so we skip them, but we could simply skip all permutations that are not used for rendering the current map.
Actually the deluxe/lightmap permutation usage is only known at run time, so the suggestion would not fully work. But we may force the building of only a couple of them to avoid that.
Another implementation for such lazy loading would just be to render the whole world and all models to some unseen framebuffer in a frame at the end of map load.
Another implementation for such lazy loading would just be to render the whole world and all models to some unseen framebuffer in a frame at the end of map load.
That would get you fairly close I suppose. There's probably still some ways that additional q3shaders can be loaded though, which could trigger more GLSL. Like maybe that map entity shader remap thing
Like maybe that map entity shader remap thing.
Interesting.
Another example of thing I know to be loaded in game is some UI textures when some UI is rendered for the first time, but this one is the kind of thing we can force the build just to be sure.
When using /glsl_restart, which currently behaves like r_lazyShaders 2, I notice that it can restart very quickly. So it does seem that on a typical map we are not using the vast majority of shaders. We might get some big wins in loading time here.
Maybe the cgame should be tasked with rendering all the possible things. For example, rendering models with all possible states (e.g. powered vs. unpowered), since those can have different shaders. To handle shader remaps, the world could be rendered multiple times with each remap activated and deactivated. But you'd have to render not just the main BSP, but also any sub-models and embedded md3/iqm/etc. models. This would require a lot of knowledge about BSP entities to added to the cgame.
Models and all that are already loaded by the first frame in https://github.com/Unvanquished/Unvanquished/blob/cc472ba717bd0e2a9e4d2e19f4daab25d154a6c4/src/cgame/cg_main.cpp#L1317-L1361
Not sure as to the BSP entities though. Engine should have enough information to build all the relevant shaders for them though.
Another implementation for such lazy loading would just be to render the whole world and all models to some unseen framebuffer in a frame at the end of map load.
That would get you fairly close I suppose. There's probably still some ways that additional q3shaders can be loaded though, which could trigger more GLSL. Like maybe that map entity shader remap thing
The remap shaders are always loaded with the main shader.
Another example of thing I know to be loaded in game is some UI textures when some UI is rendered for the first time, but this one is the kind of thing we can force the build just to be sure.
This is no longer true after removing the generic2D shader, now UI only uses 1 simple shader.
When using
/glsl_restart, which currently behaves liker_lazyShaders 2, I notice that it can restart very quickly. So it does seem that on a typical map we are not using the vast majority of shaders. We might get some big wins in loading time here.Maybe the cgame should be tasked with rendering all the possible things. For example, rendering models with all possible states (e.g. powered vs. unpowered), since those can have different shaders. To handle shader remaps, the world could be rendered multiple times with each remap activated and deactivated. But you'd have to render not just the main BSP, but also any sub-models and embedded md3/iqm/etc. models. This would require a lot of knowledge about BSP entities to added to the cgame.
All of this should work once the end registration function is called, unless cg_lazyLoadModels is enabled.
Another issue with our shader building is that we compile shaders with macros that don't affect them, e. g. we compile lightMapping_fp with vertex skinning/vertex animation, even though they have no effect there. The compilation doesn't take too long, most of the time is spent on linking, but it can still add up. There are also some permutations that we build that result in the same binary.
The link part is the bigger issue, as it takes the vast majority of the time in building shaders (at least on Nv), and we end up linking e. g. the same lightMapping_fp multiple times with macros it does not use. This can be solved by using ARB_separate_shader_objects, which is core since 4.1. I've got an in-progress branch that already does most of this (skips more unused macros, only compiles each variant once, and fixes some more issues).
Also note that for testing this, one might want to disable shader cache in the driver.
Something I thought about for the shader cache is that instead of actually tracking all the different shader we cache, and remember for what they are cached for, we would just write them using the checksum of their generated source as name. To prevent the accumulation of thousands of old shaders, we would have a max number of shaders in cache and delete the older cached shaders when building new ones in order to not have more than the max of cached shaders in cache. With some margin between the actual maximum number of shaders we use and the maximum of shaders to keep in cache, It would also help if some non-latched cvar can trigger a rebuild as the same shader but with the different generated source could be reread from cache when toggling the cvar, something we can't do right now.
Something I thought about for the shader cache is that instead of actually tracking all the different shader we cache, and remember for what they are cached for, we would just write them using the checksum of their generated source as name.
I have sort of a similar idea: have a file that stores key:value pairs for each saved program binary, where key is the alphabetically sorted list of each shader name, macro (using a unique id for each macro, instead of a shader-based one, a single number), and type (vertex, fragment, compute, just a single number), and value would be checksum or something. Then when the renderer is started, this file would be read into a hashmap, which would the be used for lookup.
To prevent the accumulation of thousands of old shaders, we would have a max number of shaders in cache and delete the older cached shaders when building new ones in order to not have more than the max of cached shaders in cache.
Yeah, we could have some sort of LRU there.
With some margin between the actual maximum number of shaders we use and the maximum of shaders to keep in cache, It would also help if some non-latched cvar can trigger a rebuild as the same shader but with the different generated source could be reread from cache when toggling the cvar, something we can't do right now.
Not sure what you mean by the latter part here. If the source is different, shouldn't it always be rebuilt?
Having shaders cached by name + macro is not a bad idea, but it does fail with deformVertexes, because it depends on the specific q3 shader. We could, of course, have directories for each map for shader programs with non-empty deforms, but I don't wanna special-case it (I've already made the shader look-up/processing generic between regular and deform shaders on my branch, and I wanna keep it that way for saving/loading too).
Something I thought about for the shader cache is that instead of actually tracking all the different shader we cache, and remember for what they are cached for, we would just write them using the checksum of their generated source as name.
Also, on my branch the checksum takes the whole shader program into account, with #insert, material post-processing, macros etc.
When using
/glsl_restart, which currently behaves liker_lazyShaders 2, I notice that it can restart very quickly. So it does seem that on a typical map we are not using the vast majority of shaders. We might get some big wins in loading time here. Maybe the cgame should be tasked with rendering all the possible things. For example, rendering models with all possible states (e.g. powered vs. unpowered), since those can have different shaders. To handle shader remaps, the world could be rendered multiple times with each remap activated and deactivated. But you'd have to render not just the main BSP, but also any sub-models and embedded md3/iqm/etc. models. This would require a lot of knowledge about BSP entities to added to the cgame.All of this should work once the end registration function is called, unless
cg_lazyLoadModelsis enabled.
One thing to note here is that dummy gamelogic doesn't (and supposedly can't?) call RE_EndRegistration().
So right now shader program binaries go into glsl/mainShaderName/shader0_macro0_shader1_macro1..., where mainShader is generic, lightMapping etc, and shaderx_macrox are the names of secondary shaders (i. e. they don't have void main()) compiled into the program, sorted alphabetically.
One thing to note here is that dummy gamelogic doesn't (and supposedly can't?) call RE_EndRegistration().
One way to do that would be to have a console command that would build the shaders.
What I'm thinking currently is to move some of the surface processing out of tr_bsp.cpp (this would allow to more easily add certain optimisations), determine the shader macro/deform index and add the permutation to a queue if it's not there. Then when either RE_EndRegistration() or the console command is called, go through all of the loaded models and add their shader permutations, then build everything in the queue. Permutations could, of course, also be built when going over the surfaces instead, but I think a queue might make more sense for the purposes of profiling map loading, + cgame displays the compiling GLSL shaders step separately anyway.
The remap shaders are always loaded with the main shader.
This is not entirely correct - apparently there's a worldspawn shader remap thing.
Another issue with our shader building is that we compile shaders with macros that don't affect them, e. g. we compile lightMapping_fp with vertex skinning/vertex animation, even though they have no effect there. The compilation doesn't take too long, most of the time is spent on linking, but it can still add up. There are also some permutations that we build that result in the same binary.
This is now fixed, except for the link part.
Done in #1656.