Huge performance drop on r300 with default Thunder and Vega scene
I'm trying to investigate a huge performance drop affecting some low-end cards.
I currently suspect:
- a bug in the driver itself.
- something stupid done on our side similar to the way our skybox code was iterating over GPU memory directly: https://github.com/DaemonEngine/Daemon/issues/849#issuecomment-1536053464
- a combination of the twos.
Here is the Thunder map, lowest preset, 640×480 resolution, default spectator scene (1024 -3456 448 -90 0), 3fps:
Here is the Vega map, lowest preset, 640×480 resolution, default spectator scene (1472 -1792 176 -180 0), 5fps:
This configuration is expected to achieve ~~20~30fps~~ 60fps.
The testbench is a very low-end config, one of my favorites for identifying bottlenecks:
- CPU: AMD Athlon 64 X2 4400+, dual core 2.3GHz.
- RAM: 8Go DDR2 800MHz , 4×2Go Hynix HYMP125U64CP8-S6, dual channel.
- GPU: ATi Radeon X550 HyperMemory, RV370 (r300, tiny ALU), 128Mo 64-bit DDR, PCIe x16 1.0.
- OS: Ubuntu 23.10 mantic, Linux 6.5.0.
- Driver: Mesa 23.2.1 (stock) or Mesa 24.0.7 (self built).
In the Thunder map, I suspect the bottleneck comes from rendering something in that room and/or the outside. It can be the electric arc effect on those pillars, and/or the lightning effect on the top of the outside of the building, as this room as huge windows on the outside. Leaving that room and losing sight with the outside brings back performance, leaving that room without losing sight with the outside keeps the performance low.
In the Vega map, I suspect the bottleneck comes from rendering that globe or something around that globe. From the 4 sides of the room, I recover performance by turning back to that globe.
I identified places were just turning around the view enables/disables the performance loss in both maps.
Thunder, -30 -1497 10 -20 0:
Thunder, -30 -1497 10 112 0:
Vega, 667 -1792 176 0 0:
Vega: 667 -1792 176 -180 0:
As you see, just turning around the head without moving from the place you are enables/disables the cause of the bottleneck.
The cause of the bottleneck is very hard to track down.
Orbit fails to show me the calltrace, only identifying a very costly ioctl without any identified caller. Once, Orbit managed to relate the ioctl to some render thread in the Bottom-Up view.
And gprofng also shows me something suspect with openal, but what if the i/o on the sound card is actually slowed down by the whole PCIe bus being on knees because of agressive io for rendering?
But gperftools seems to track it back to the r300 DRM driver through Mesa, more explicitely:
That's why I suspect something similar to what we had with the sky: that GPU has a slow PCIe 1.0 x16 bus, and if we process the memory GPU directly in our computation, this can be very bad.
As you see, just turning around the head without moving from the place you are enables/disables the cause of the bottleneck.
What if you move backwards from there? If perf drops once the globe is in view, then it is the issue, otherwise it's not.
Both the electric arc effect and the globe materials use some tcMod:
textures/thunder_lightning/horizontallightning
{
qer_editorImage textures/thunder_lightning_src/horizontallightning
surfaceparm trans
surfaceparm nolightmap
surfaceparm nomarks
surfaceparm nonsolid
{
map textures/thunder_lightning_src/horizontallightning
blendfunc add
tcmod scroll -8 -0
}
}
models/mapobjects/vega/globe/globe
{
surfaceparm nomarks
surfaceparm nonsolid
cull none
{
map models/mapobjects/vega/globe/globe
tcmod scroll .05 0
}
}
In the slow sky thread, SomaZ recommended to also look at tcMod:
Should probably also make sure any tcMods or else also don't read/write to tess. -- https://github.com/DaemonEngine/Daemon/issues/849#issuecomment-1536071971
tcMod uses a matrix uniform.
Commenting out the tcmod scroll line in the vega globe shader doesn't bring back the performance.
I also verified that doing stage->bundle[ 0 ].numTexMods = 0 in FinishStages also disables the scrolling but doesn't bring back the performance.
| viewpos | relation | fps |
|---|---|---|
| 511 -1792 176 -180 0 | globe in the back | 40 |
| 615 -1792 176 -180 0 | globe in the back | 30 |
| 773 -1792 176 -180 0 | globe in the back | 17 |
| 844 -1792 176 -180 0 | globe in the back | 14 |
| 893 -1792 176 -180 0 | inside globe | 11 |
| 934 -1792 176 -180 0 | inside globe | 10 |
| 997 -1792 176 -180 0 | inside globe | 10 |
| 1006 -1792 176 -180 0 | globe in front | 10 |
| 1074 -1792 176 -180 0 | globe in front | 10 |
| 1143 -1792 176 -180 0 | globe in front | 9 |
| 1288 -1792 176 -180 0 | globe in front | 6 |
| 1382 -1792 176 -180 0 | globe in front | 6 |
| 1621 -1792 176 -180 0 | globe in front | 4 |
One may also suspect the amount of tris, but vega was running properly in Tremulous.
| map | viewpos | performance |
|---|---|---|
| thunder | 1472 -1792 176 -180 0 | slow |
| thunder | 667 -1792 176 0 0 | slow |
| thunder | 667 -1792 176 -180 0 | fast |
| map | viewpos | performance |
|---|---|---|
| vega | 1024 -3456 448 -90 0 | slow |
| vega | -30 -1497 10 -20 0 | slow |
| vega | -30 -1497 10 112 0 | fast |
But I can render the whole plat23 map (0 3568 3194 -90 66) with all tris visible, and keep a smooth 30fps:
But I can render the whole plat23 map (
0 3568 3194 -90 66) with all tris visible, and keep a smooth 30fps:
For comparing things like that it'd be helpful to know how many triangles are in each case in particular (some value of r_speeds shows that I believe).
Same computer, but with onboard nvidia GPU.
- GPU: Nvidia Geforce 6150 LE, NV4C/MCP61 (Curie, ), 128Mo shared DDR2, HyperTransport.
Where the comparison is not fair is that instead of dedicated memory the Nvidia chip uses shared DDR2 memory from the host and instead of PCIe 1.0 x16 bus the Nvidia uses HyperTransport (some onboard bus shared with CPU). This may help with ioctl when something on the GPU is read from/written to by the CPU.
The Nvidia has a larger ALU (but that should not make a difference as long as shader runs) and may accept larger images (or properly downscale them). The ATi one uses ARB_framebuffer_object but the Nvidia only provides the older EXT_framebuffer_object which is a subset (but that should make no difference because we only use the EXT_framebuffer_object part of ARB_framebuffer_object).
Like the ATi one is from the first generation of ATi cards to support OpenGL 2, that Nvidia card is from the first generation of Nvidia cards to support OpenGL 2, and is from the year after then ATi one. Both are low end, in fact the “LE“ string in the product name means “low end”.
That card can sustain solid 20fps on both scenes: (Thunder 1024 -3456 448 -90 0, Vega 1472 -1792 176 -180 0):
For comparison on the same computer and that old ATi card, the vega-beta1a map on Tremulous (some GrangerHub Tremulous 1.3 build) on OpenGL 1 renderer gets 130fps (almost 140fps) with lightmaps and 190fps with vertex lighting on same scene:
With the OpenGL 2 renderer it gets 40fps (almost 50fps):
Except for the vertex lighting, this is using the default Tremulous configuration (no “lowest”-like preset).
That other engine is also ridled with bugs, on such hardware the OpenGL1 renderer spawns lots of error logs every frame:
Mesa: error: GL_INVALID_VALUE in glLockArraysEXT(count)
Mesa: error: GL_INVALID_OPERATION in glUnlockArraysEXT(reexit)
and on such hardware the OpenGL2 renderer doesn't work at all with lightmaps, and some fonts are totally broken, and some texture blending is broken:
But the experiments says something: not only it may be possible to render such map on such hardware without any performance drop in that scene, but we may be able to achieve double the performance on such hardware. Edit: In fact we already achieve 60fps in other maps that don't trigger the slow down.
This test was done with the ATi card described in first message, on the computer described in first message.
That's not the same map build that is used in Tremulous, but they should not be that different. And the amount of tris should be similar:
So, it looks like I uncovered not only one performance problem but two.
First performance problem is that serious performance drop in some places.
Second performance problem is that we may be able to get twice the same performance on some lower-end hardware but don't get it.
I expect such hardware to perform better with the OpenGL1 renderer, as at the time most of the game were doing things the GL1 way and only using shaders for some effects, so the hardware is probably architectured for that in mind. On the contrary we use shaders for everything.
But since the OpenGL 2 renderer of ioquake3 seems to still be twice faster than us when doing things the shader way in that range of hardware, I believe we should also get twice the performance we already have (reaching 40fps on such map).
And I'm very annoyed by how much it is hard to track down this. I rebuilt entirely Orbit and its dependencies to run on that old CPU, I have a own build of Mesa and libdrm (even libpciaccess) built with debug symbols, I installed all the debug packages I could find for the installed packages of my distro, including the one of the libc, and Orbit still fails to track down the slow code backtrace.
It looks like I got a calltrace with GDB on drmIoctl:
(gdb) break drmIoctl
Breakpoint 1 at 0x716d6f6ad070: file ../xf86drm.c, line 700.
(gdb) continue
Continuing.
[Switching to Thread 0x716d6e01ba80 (LWP 492851)]
Thread 1 "daemon" hit Breakpoint 1, drmIoctl (fd=15, request=request@entry=3223348318, arg=arg@entry=0x7ffec10c4a50) at ../xf86drm.c:700
700 {
(gdb) backtrace
#0 drmIoctl (fd=15, request=request@entry=3223348318, arg=arg@entry=0x7ffec10c4a50) at ../xf86drm.c:700
#1 0x0000716d6f6b0260 in drmCommandWriteRead (fd=<optimized out>, drmCommandIndex=drmCommandIndex@entry=30, data=data@entry=0x7ffec10c4a50, size=size@entry=32) at ../xf86drm.c:3186
#2 0x0000716d4a82539c in radeon_bo_do_map (bo=0x5688c5e046a0) at ../src/gallium/winsys/radeon/drm/radeon_drm_bo.c:445
#3 0x0000716d4a7e837e in r300_buffer_transfer_map (context=0x5688c260d040, resource=0x5688c5dd0430, level=<optimized out>, usage=802, box=0x7ffec10c4b30, ptransfer=0x5688c254bf78) at ../src/gallium/drivers/r300/r300_screen_buffer.c:146
#4 0x0000716d4a6b1730 in pipe_buffer_map_range (transfer=0x5688c254bf78, access=<optimized out>, length=4235264, offset=0, buffer=<optimized out>, pipe=<optimized out>) at ../src/gallium/auxiliary/util/u_inlines.h:398
#5 u_upload_alloc_buffer (min_size=4232544, upload=0x5688c254bf50) at ../src/gallium/auxiliary/util/u_upload_mgr.c:240
#6 u_upload_alloc (upload=0x5688c254bf50, min_out_offset=<optimized out>, size=48, alignment=alignment@entry=4, out_offset=out_offset@entry=0x7ffec10c4c84, outbuf=outbuf@entry=0x7ffec10c4c88, ptr=0x7ffec10c4c90) at ../src/gallium/auxiliary/util/u_upload_mgr.c:273
#7 0x0000716d4a6b2ca3 in u_vbuf_translate_buffers (unroll_indices=false, min_index=<optimized out>, num_vertices=3, start_vertex=264531, out_vb=1, vb_mask=<optimized out>, draw=<optimized out>, info=<optimized out>, key=<optimized out>, mgr=0x5688c25e1130) at ../src/gallium/auxiliary/util/u_vbuf.c:587
#8 u_vbuf_translate_begin (mgr=mgr@entry=0x5688c25e1130, info=info@entry=0x7ffec10c5b40, draw=draw@entry=0x7ffec10c5b24, start_vertex=start_vertex@entry=264531, num_vertices=num_vertices@entry=3, min_index=<optimized out>, unroll_indices=<optimized out>, misaligned=<optimized out>) at ../src/gallium/auxiliary/util/u_vbuf.c:805
#9 0x0000716d4a6b5120 in u_vbuf_draw_vbo (pipe=0x5688c260d040, info=0x7ffec10c5cd0, drawid_offset=93, indirect=0x0, draws=<optimized out>, num_draws=243) at ../src/gallium/auxiliary/util/u_vbuf.c:1726
#10 0x0000716d4a2d9cb3 in _mesa_validated_multidrawelements (ctx=ctx@entry=0x5688c24fcef0, index_bo=<optimized out>, mode=mode@entry=4, count=count@entry=0x5688c1947120 <tess+8432>, type=type@entry=5125, indices=indices@entry=0x5688c19451e0 <tess+432>, primcount=<optimized out>, basevertex=0x0) at ../src/mesa/main/draw.c:2118
#11 0x0000716d4a2dbbc9 in _mesa_MultiDrawElements (mode=4, count=0x5688c1947120 <tess+8432>, type=5125, indices=0x5688c19451e0 <tess+432>, primcount=243) at ../src/mesa/main/draw.c:2171
#12 0x00005688c0c6b099 in Tess_DrawElements () at Unvanquished/daemon/src/engine/renderer/tr_shade.cpp:328
#13 0x00005688c0c6c320 in Render_generic3D (pStage=0x716d6c7148c0) at Unvanquished/daemon/src/engine/renderer/tr_shade.cpp:791
#14 0x00005688c0c6c38b in Render_generic (pStage=0x716d6c7148c0) at Unvanquished/daemon/src/engine/renderer/tr_shade.cpp:804
#15 0x00005688c0c71634 in Tess_StageIteratorColor () at Unvanquished/daemon/src/engine/renderer/tr_shade.cpp:2601
#16 0x00005688c0c71f29 in Tess_End () at Unvanquished/daemon/src/engine/renderer/tr_shade.cpp:2837
#17 0x00005688c0bc8387 in RB_RenderDrawSurfaces (fromSort=shaderSort_t::SS_DEPTH, toSort=shaderSort_t::SS_DEPTH, drawSurfFilter=DRAWSURFACES_ALL) at Unvanquished/daemon/src/engine/renderer/tr_backend.cpp:903
#18 0x00005688c0bd4f33 in RB_RenderView (depthPass=true) at Unvanquished/daemon/src/engine/renderer/tr_backend.cpp:4802
#19 0x00005688c0bd7e0c in DrawViewCommand::ExecuteSelf (this=0x716d4fa8fa74) at Unvanquished/daemon/src/engine/renderer/tr_backend.cpp:5710
#20 0x00005688c0bd85e6 in RB_ExecuteRenderCommands (data=0x716d4fa8f04c) at Unvanquished/daemon/src/engine/renderer/tr_backend.cpp:5961
#21 0x00005688c0bf9f72 in R_IssueRenderCommands (runPerformanceCounters=true) at Unvanquished/daemon/src/engine/renderer/tr_cmds.cpp:183
#22 0x00005688c0bfb07d in RE_EndFrame (frontEndMsec=0x0, backEndMsec=0x0) at Unvanquished/daemon/src/engine/renderer/tr_cmds.cpp:889
#23 0x00005688c0b35ce0 in SCR_UpdateScreen () at Unvanquished/daemon/src/engine/client/cl_scrn.cpp:327
#24 0x00005688c0b25dc6 in CL_Frame (msec=193) at Unvanquished/daemon/src/engine/client/cl_main.cpp:2087
#25 0x00005688c0a44870 in Com_Frame () at Unvanquished/daemon/src/engine/qcommon/common.cpp:958
#26 0x00005688c0a3d259 in Application::ClientApplication::Frame (this=0x5688c0f23f40 <Application::GetApp()::app>) at Unvanquished/daemon/src/engine/client/ClientApplication.cpp:96
#27 0x00005688c0d2de4f in Application::Frame () at Unvanquished/daemon/src/engine/framework/Application.cpp:87
#28 0x00005688c0d6f4d7 in main (argc=60, argv=0x7ffec10c62e8) at Unvanquished/daemon/src/engine/framework/System.cpp:784
The slow code seems to be (if I caught the right drmIoctl call):
https://github.com/DaemonEngine/Daemon/blob/f42c220fe165591f1325d26edd3160032c698865/src/engine/renderer/tr_shade.cpp#L328
glMultiDrawElements( GL_TRIANGLES, tess.multiDrawCounts,
GL_INDEX_TYPE, ( const GLvoid ** ) tess.multiDrawIndexes, tess.multiDrawPrimitives );
I added this logger right before the glMultiDrawElements one:
Log::Warn( "=== tess.multiDrawPrimitives: %d", tess.multiDrawPrimitives );
Then on the default Vega spectator scene (1472 -1792 176 -180 0), it loops printing this:
…
Warn: === tess.multiDrawPrimitives: 243
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 2
Warn: === tess.multiDrawPrimitives: 8
Warn: === tess.multiDrawPrimitives: 3
Warn: === tess.multiDrawPrimitives: 6
Warn: === tess.multiDrawPrimitives: 2
Warn: === tess.multiDrawPrimitives: 40
Warn: === tess.multiDrawPrimitives: 10
Warn: === tess.multiDrawPrimitives: 10
Warn: === tess.multiDrawPrimitives: 21
Warn: === tess.multiDrawPrimitives: 4
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 12
Warn: === tess.multiDrawPrimitives: 7
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 3
Warn: === tess.multiDrawPrimitives: 4
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 14
Warn: === tess.multiDrawPrimitives: 3
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 2
Warn: === tess.multiDrawPrimitives: 4
Warn: === tess.multiDrawPrimitives: 7
Warn: === tess.multiDrawPrimitives: 2
Warn: === tess.multiDrawPrimitives: 2
Warn: === tess.multiDrawPrimitives: 2
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 3
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 7
Warn: === tess.multiDrawPrimitives: 7
Warn: === tess.multiDrawPrimitives: 7
Warn: === tess.multiDrawPrimitives: 1
[again]
On the 667 -1792 176 0 0 slow scene, I get this:
Warn: === tess.multiDrawPrimitives: 246
Warn: === tess.multiDrawPrimitives: 7
Warn: === tess.multiDrawPrimitives: 8
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 2
Warn: === tess.multiDrawPrimitives: 4
Warn: === tess.multiDrawPrimitives: 9
Warn: === tess.multiDrawPrimitives: 7
Warn: === tess.multiDrawPrimitives: 4
Warn: === tess.multiDrawPrimitives: 2
Warn: === tess.multiDrawPrimitives: 23
Warn: === tess.multiDrawPrimitives: 17
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 17
Warn: === tess.multiDrawPrimitives: 3
Warn: === tess.multiDrawPrimitives: 20
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 7
Warn: === tess.multiDrawPrimitives: 5
Warn: === tess.multiDrawPrimitives: 5
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 4
Warn: === tess.multiDrawPrimitives: 3
Warn: === tess.multiDrawPrimitives: 7
Warn: === tess.multiDrawPrimitives: 7
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 3
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 4
Warn: === tess.multiDrawPrimitives: 2
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 2
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 2
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 7
Warn: === tess.multiDrawPrimitives: 9
Warn: === tess.multiDrawPrimitives: 4
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 3
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 4
Warn: === tess.multiDrawPrimitives: 4
Warn: === tess.multiDrawPrimitives: 4
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
[again]
On the 667 -1792 176 -180 0 fast scene (same position in space but looking in opposite direction), it loops printing this:
Warn: === tess.multiDrawPrimitives: 62
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 16
Warn: === tess.multiDrawPrimitives: 3
Warn: === tess.multiDrawPrimitives: 4
Warn: === tess.multiDrawPrimitives: 3
Warn: === tess.multiDrawPrimitives: 3
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 4
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 7
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 2
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
Warn: === tess.multiDrawPrimitives: 1
[again]
Just as a quick test, reducing MAX_MULTIDRAW_PRIMITIVES doesn't help.
Also the ioq3 and grangerhub's tremulous source codes have no reference to glMultiDrawElements. 🤔️
At the same places we call Tess_DrawElements that calls glMultiDrawElements, they call R_DrawElements that calls glDrawElements.
Also for curiosity, I did that, but I still get 5fps on that default Vega scene:
if ( tess.multiDrawPrimitives )
{
- glMultiDrawElements( GL_TRIANGLES, tess.multiDrawCounts, GL_INDEX_TYPE, ( const GLvoid ** ) tess.multiDrawIndexes, tess.multiDrawPrimitives );
+ if ( false )
+ {
+ glMultiDrawElements( GL_TRIANGLES, tess.multiDrawCounts, GL_INDEX_TYPE, ( const GLvoid ** ) tess.multiDrawIndexes, tess.multiDrawPrimitives );
+ }
+ else
+ {
+ for ( i = 0; i < tess.multiDrawPrimitives; i++ )
+ {
+ glDrawElements( GL_TRIANGLES, tess.multiDrawCounts[ i ], GL_INDEX_TYPE, tess.multiDrawIndexes[ i ] );
+ }
+ }
And if I interrupt, I interrupt in the middle of an ioctl called by my own glDrawElements:
Thread 1 (Thread 0x74bd4686ca80 (LWP 519073) "daemon"):
#0 __GI___ioctl (fd=fd@entry=15, request=request@entry=3223348317) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1 0x000074bd47ef00a0 in drmIoctl (fd=15, request=request@entry=3223348317, arg=arg@entry=0x7fff5a18a700) at ../xf86drm.c:704
#2 0x000074bd47ef3260 in drmCommandWriteRead (fd=<optimized out>, drmCommandIndex=drmCommandIndex@entry=29, data=data@entry=0x7fff5a18a700, size=size@entry=32) at ../xf86drm.c:3186
#3 0x000074bd23076013 in radeon_create_bo (rws=rws@entry=0x5f49b6c500e0, size=size@entry=4370432, alignment=alignment@entry=4096, initial_domains=initial_domains@entry=2, flags=flags@entry=16, heap=heap@entry=0) at ../src/gallium/winsys/radeon/drm/radeon_drm_bo.c:632
#4 0x000074bd230767c5 in radeon_winsys_bo_create (rws=0x5f49b6c500e0, size=4370432, alignment=<optimized out>, domain=<optimized out>, flags=<optimized out>) at ../src/gallium/winsys/radeon/drm/radeon_drm_bo.c:1043
#5 0x000074bd2303a5e4 in r300_buffer_create (screen=0x5f49b6c51050, templ=0x7fff5a18a830) at ../src/gallium/drivers/r300/r300_screen_buffer.c:191
#6 0x000074bd22f036a3 in u_upload_alloc_buffer (min_size=4366992, upload=0x5f49b6cfe660) at ../src/gallium/auxiliary/util/u_upload_mgr.c:208
#7 u_upload_alloc (upload=0x5f49b6cfe660, min_out_offset=<optimized out>, size=2816, alignment=alignment@entry=4, out_offset=out_offset@entry=0x7fff5a18a974, outbuf=outbuf@entry=0x7fff5a18a978, ptr=0x7fff5a18a980) at ../src/gallium/auxiliary/util/u_upload_mgr.c:273
#8 0x000074bd22f04ca3 in u_vbuf_translate_buffers (unroll_indices=false, min_index=<optimized out>, num_vertices=176, start_vertex=272761, out_vb=1, vb_mask=<optimized out>, draw=<optimized out>, info=<optimized out>, key=<optimized out>, mgr=0x5f49b6d937a0) at ../src/gallium/auxiliary/util/u_vbuf.c:587
#9 u_vbuf_translate_begin (mgr=mgr@entry=0x5f49b6d937a0, info=info@entry=0x7fff5a18b830, draw=draw@entry=0x7fff5a18b814, start_vertex=start_vertex@entry=272761, num_vertices=num_vertices@entry=176, min_index=<optimized out>, unroll_indices=<optimized out>, misaligned=<optimized out>) at ../src/gallium/auxiliary/util/u_vbuf.c:805
#10 0x000074bd22f07120 in u_vbuf_draw_vbo (pipe=0x5f49b6cad050, info=0x7fff5a18b9e0, drawid_offset=0, indirect=0x0, draws=<optimized out>, num_draws=1) at ../src/gallium/auxiliary/util/u_vbuf.c:1726
#11 0x000074bd22b2c18d in _mesa_validated_drawrangeelements (ctx=ctx@entry=0x5f49b6caf170, index_bo=<optimized out>, mode=mode@entry=4, index_bounds_valid=index_bounds_valid@entry=false, start=start@entry=0, end=end@entry=4294967295, count=276, type=5125, indices=0x2353c8, basevertex=0, numInstances=1, baseInstance=0) at ../src/mesa/main/draw.c:1734
#12 0x000074bd22b2d689 in _mesa_DrawElementsBaseVertex (mode=4, count=276, type=5125, indices=0x2353c8, basevertex=0) at ../src/mesa/main/draw.c:1877
#13 0x00005f49b54c70c2 in Tess_DrawElements () at Unvanquished/daemon/src/engine/renderer/tr_shade.cpp:337
#14 0x00005f49b54c8344 in Render_generic3D (pStage=0x74bd44f148c0) at Unvanquished/daemon/src/engine/renderer/tr_shade.cpp:802
#15 0x00005f49b54c83af in Render_generic (pStage=0x74bd44f148c0) at Unvanquished/daemon/src/engine/renderer/tr_shade.cpp:815
#16 0x00005f49b54cd658 in Tess_StageIteratorColor () at Unvanquished/daemon/src/engine/renderer/tr_shade.cpp:2612
#17 0x00005f49b54cdf4d in Tess_End () at Unvanquished/daemon/src/engine/renderer/tr_shade.cpp:2848
#18 0x00005f49b5424387 in RB_RenderDrawSurfaces (fromSort=shaderSort_t::SS_DEPTH, toSort=shaderSort_t::SS_DEPTH, drawSurfFilter=DRAWSURFACES_ALL) at Unvanquished/daemon/src/engine/renderer/tr_backend.cpp:903
#19 0x00005f49b5430f33 in RB_RenderView (depthPass=true) at Unvanquished/daemon/src/engine/renderer/tr_backend.cpp:4802
#20 0x00005f49b5433e0c in DrawViewCommand::ExecuteSelf (this=0x74bd2828fa74) at Unvanquished/daemon/src/engine/renderer/tr_backend.cpp:5710
#21 0x00005f49b54345e6 in RB_ExecuteRenderCommands (data=0x74bd2828f04c) at Unvanquished/daemon/src/engine/renderer/tr_backend.cpp:5961
#22 0x00005f49b5455f72 in R_IssueRenderCommands (runPerformanceCounters=true) at Unvanquished/daemon/src/engine/renderer/tr_cmds.cpp:183
#23 0x00005f49b545707d in RE_EndFrame (frontEndMsec=0x0, backEndMsec=0x0) at Unvanquished/daemon/src/engine/renderer/tr_cmds.cpp:889
#24 0x00005f49b5391ce0 in SCR_UpdateScreen () at Unvanquished/daemon/src/engine/client/cl_scrn.cpp:327
#25 0x00005f49b5381dc6 in CL_Frame (msec=188) at Unvanquished/daemon/src/engine/client/cl_main.cpp:2087
#26 0x00005f49b52a0870 in Com_Frame () at Unvanquished/daemon/src/engine/qcommon/common.cpp:958
#27 0x00005f49b5299259 in Application::ClientApplication::Frame (this=0x5f49b577ff40 <Application::GetApp()::app>) at Unvanquished/daemon/src/engine/client/ClientApplication.cpp:96
#28 0x00005f49b5589e73 in Application::Frame () at Unvanquished/daemon/src/engine/framework/Application.cpp:87
#29 0x00005f49b55cb4fb in main (argc=60, argv=0x7fff5a18c008) at Unvanquished/daemon/src/engine/framework/System.cpp:784
It looks to happen right after the autosprite deform in RB_RenderDrawSurfaces (skipping that autosprite deform doesn't help):
if ( oldShader != nullptr )
{
if ( oldShader->autoSpriteMode && !(tess.attribsSet & ATTR_ORIENTATION) ) {
Tess_AutospriteDeform( oldShader->autoSpriteMode,
0, tess.numVertexes,
0, tess.numIndexes );
}
Tess_End(); // <-- here
}
So I did that:
diff --git a/src/engine/renderer/tr_shade.cpp b/src/engine/renderer/tr_shade.cpp
index c99eb8369..3063a5117 100644
--- a/src/engine/renderer/tr_shade.cpp
+++ b/src/engine/renderer/tr_shade.cpp
@@ -311,6 +311,8 @@ void GLSL_FinishGPUShaders()
Tess_DrawElements
==================
*/
+#include <ctime>
+#include <cstdio>
void Tess_DrawElements()
{
int i;
@@ -325,7 +327,10 @@ void Tess_DrawElements()
{
if ( tess.multiDrawPrimitives )
{
+ clock_t start = clock();
glMultiDrawElements( GL_TRIANGLES, tess.multiDrawCounts, GL_INDEX_TYPE, ( const GLvoid ** ) tess.multiDrawIndexes, tess.multiDrawPrimitives );
+ clock_t diff = clock() - start;
+ printf("=== time: %ld, primitives: %d\n", diff, tess.multiDrawPrimitives);
backEnd.pc.c_multiDrawElements++;
backEnd.pc.c_multiDrawPrimitives += tess.multiDrawPrimitives;
I get this:
=== time: 108147, primitives: 243
=== time: 1714, primitives: 1
=== time: 102, primitives: 2
=== time: 662, primitives: 8
=== time: 79, primitives: 3
=== time: 39, primitives: 6
=== time: 70, primitives: 2
=== time: 767, primitives: 40
=== time: 5316, primitives: 10
=== time: 11874, primitives: 10
=== time: 10550, primitives: 21
=== time: 7498, primitives: 4
=== time: 1557, primitives: 1
=== time: 1463, primitives: 1
=== time: 9861, primitives: 12
=== time: 177, primitives: 7
=== time: 43, primitives: 1
=== time: 41, primitives: 3
=== time: 112, primitives: 4
=== time: 39, primitives: 1
=== time: 176, primitives: 14
=== time: 96, primitives: 3
=== time: 122, primitives: 1
=== time: 2139, primitives: 2
=== time: 3501, primitives: 4
=== time: 1959, primitives: 7
=== time: 1757, primitives: 2
=== time: 3340, primitives: 2
=== time: 1979, primitives: 2
=== time: 42, primitives: 1
=== time: 26, primitives: 1
=== time: 39, primitives: 1
=== time: 21, primitives: 1
=== time: 2012, primitives: 1
=== time: 50, primitives: 1
=== time: 1914, primitives: 1
=== time: 69, primitives: 1
=== time: 39, primitives: 1
=== time: 30, primitives: 1
=== time: 1677, primitives: 3
=== time: 94, primitives: 1
=== time: 36, primitives: 7
=== time: 3622, primitives: 7
=== time: 100, primitives: 7
=== time: 60, primitives: 1
With:
diff --git a/src/engine/renderer/tr_shade.cpp b/src/engine/renderer/tr_shade.cpp
index c99eb8369..01aac402c 100644
--- a/src/engine/renderer/tr_shade.cpp
+++ b/src/engine/renderer/tr_shade.cpp
@@ -311,6 +311,8 @@ void GLSL_FinishGPUShaders()
Tess_DrawElements
==================
*/
+#include <ctime>
+#include <cstdio>
void Tess_DrawElements()
{
int i;
@@ -325,7 +327,10 @@ void Tess_DrawElements()
{
if ( tess.multiDrawPrimitives )
{
+ clock_t start = clock();
glMultiDrawElements( GL_TRIANGLES, tess.multiDrawCounts, GL_INDEX_TYPE, ( const GLvoid ** ) tess.multiDrawIndexes, tess.multiDrawPrimitives );
+ clock_t diff = clock() - start;
+ printf("=== time: %ld, primitives: %d, time/primitives: %f\n", diff, tess.multiDrawPrimitives, (float) diff/tess.multiDrawPrimitives);
backEnd.pc.c_multiDrawElements++;
backEnd.pc.c_multiDrawPrimitives += tess.multiDrawPrimitives;
I get:
=== time: 108013, primitives: 243, time/primitives: 444.497955
=== time: 1716, primitives: 1, time/primitives: 1716.000000
=== time: 81, primitives: 2, time/primitives: 40.500000
=== time: 667, primitives: 8, time/primitives: 83.375000
=== time: 46, primitives: 3, time/primitives: 15.333333
=== time: 39, primitives: 6, time/primitives: 6.500000
=== time: 68, primitives: 2, time/primitives: 34.000000
=== time: 730, primitives: 40, time/primitives: 18.250000
=== time: 5799, primitives: 10, time/primitives: 579.900024
=== time: 10355, primitives: 10, time/primitives: 1035.500000
=== time: 12884, primitives: 21, time/primitives: 613.523804
=== time: 4710, primitives: 4, time/primitives: 1177.500000
=== time: 1542, primitives: 1, time/primitives: 1542.000000
=== time: 1443, primitives: 1, time/primitives: 1443.000000
=== time: 11180, primitives: 12, time/primitives: 931.666687
=== time: 187, primitives: 7, time/primitives: 26.714285
=== time: 34, primitives: 1, time/primitives: 34.000000
=== time: 37, primitives: 3, time/primitives: 12.333333
=== time: 69, primitives: 4, time/primitives: 17.250000
=== time: 33, primitives: 1, time/primitives: 33.000000
=== time: 141, primitives: 14, time/primitives: 10.071428
=== time: 79, primitives: 3, time/primitives: 26.333334
=== time: 109, primitives: 1, time/primitives: 109.000000
=== time: 1615, primitives: 2, time/primitives: 807.500000
=== time: 1671, primitives: 4, time/primitives: 417.750000
=== time: 2049, primitives: 7, time/primitives: 292.714294
=== time: 1658, primitives: 2, time/primitives: 829.000000
=== time: 3267, primitives: 2, time/primitives: 1633.500000
=== time: 3889, primitives: 2, time/primitives: 1944.500000
=== time: 34, primitives: 1, time/primitives: 34.000000
=== time: 12, primitives: 1, time/primitives: 12.000000
=== time: 44, primitives: 1, time/primitives: 44.000000
=== time: 17, primitives: 1, time/primitives: 17.000000
=== time: 2041, primitives: 1, time/primitives: 2041.000000
=== time: 44, primitives: 1, time/primitives: 44.000000
=== time: 1917, primitives: 1, time/primitives: 1917.000000
=== time: 64, primitives: 1, time/primitives: 64.000000
=== time: 74, primitives: 1, time/primitives: 74.000000
=== time: 30, primitives: 1, time/primitives: 30.000000
=== time: 1322, primitives: 3, time/primitives: 440.666656
=== time: 67, primitives: 1, time/primitives: 67.000000
=== time: 28, primitives: 7, time/primitives: 4.000000
=== time: 1740, primitives: 7, time/primitives: 248.571426
=== time: 91, primitives: 7, time/primitives: 13.000000
=== time: 48, primitives: 1, time/primitives: 48.000000
Some of them are really slow to render, 1 or 2ms for a single thing…
If I take a simple scene that achieves 47fps on the same Vega map (358 -1792 170 -180 0) , I get this:
=== time: 545, primitives: 28, time/primitives: 19.464285
=== time: 232, primitives: 9, time/primitives: 25.777779
=== time: 50, primitives: 1, time/primitives: 50.000000
=== time: 33, primitives: 2, time/primitives: 16.500000
=== time: 23, primitives: 2, time/primitives: 11.500000
=== time: 27, primitives: 2, time/primitives: 13.500000
=== time: 1516, primitives: 1, time/primitives: 1516.000000
=== time: 1322, primitives: 5, time/primitives: 264.399994
=== time: 1683, primitives: 2, time/primitives: 841.500000
=== time: 1925, primitives: 1, time/primitives: 1925.000000
=== time: 26, primitives: 1, time/primitives: 26.000000
=== time: 50, primitives: 1, time/primitives: 50.000000
=== time: 18, primitives: 1, time/primitives: 18.000000
=== time: 1613, primitives: 1, time/primitives: 1613.000000
=== time: 53, primitives: 1, time/primitives: 53.000000
=== time: 43, primitives: 1, time/primitives: 43.000000
There is much less things to do, but still things lasting more than 1ms and even close to 2ms.
So it looks like we have a very slow stuff somewhere, and having more tris just gives more chance to get those slow stuff to render. If we find them and fasten them, we may not only fix the huge performance drop but also the general performance.
Now I need help to identify what are those things so slow to render…
So I did this:
diff --git a/src/engine/renderer/tr_shade.cpp b/src/engine/renderer/tr_shade.cpp
index c99eb8369..8b4a46f45 100644
--- a/src/engine/renderer/tr_shade.cpp
+++ b/src/engine/renderer/tr_shade.cpp
@@ -793,6 +793,8 @@ void Render_generic3D( shaderStage_t *pStage )
GL_CheckErrors();
}
+#include <ctime>
+#include <cstdio>
void Render_generic( shaderStage_t *pStage )
{
if ( backEnd.projection2D )
@@ -801,7 +803,10 @@ void Render_generic( shaderStage_t *pStage )
return;
}
+ clock_t start = clock();
Render_generic3D( pStage );
+ clock_t diff = clock() - start;
+ printf("=== time: %ld, %s\n", diff, tess.surfaceShader->name);
}
/*
And I got this:
=== time: 106604, <default>$depth
=== time: 1666, models/mapobjects/vega/globe/globe$depth
=== time: 123, models/mapobjects/vega/globe/globe
=== time: 58, models/mapobjects/vega/monitor/display_door_red
=== time: 14, models/mapobjects/vega/monitor/display_door_blue
=== time: 2043, textures/vega_custom/starchart02
=== time: 33, textures/vega_custom/wheeltracks
=== time: 86, textures/vega_custom/holo_projector
=== time: 34, textures/shared_vega/glass01
=== time: 1764, textures/shared_vega/glass01
I don't know why the $depth things are so slow, at least we notice that the globe has one, but the textures/vega_custom/starchart02 looks totally stupidly slow, it's the simplest material we can have:
textures/vega_custom/starchart02
{
qer_editorImage textures/vega_custom_src/starchart02_d
q3map_surfacelight 150
surfaceparm nolightmap
{
map textures/vega_custom_src/starchart02_d
}
}
Spending 2ms on rendering this is very wrong.
And for the glass01 that also looks stupidly slow, it's a bit more complex but not crazily complex:
textures/shared_vega/glass01
{
qer_editorImage textures/shared_vega_src/glass01_b
qer_trans .7
cull none
{
map textures/shared_vega_src/glass01_env
blendFunc gl_dst_color gl_one
tcGen environment
}
{
map textures/shared_vega_src/glass01_b
blend filter
}
{
map $lightmap
blendFunc gl_dst_color gl_one
}
}
Also, why one glass01 stage is fast, and one is slow?
And since the lightmap is rendered by Render_lightmap and not Render_generic3D, it looks like the slow glass01 stage is the second one, so this one:
{
map textures/shared_vega_src/glass01_b
blend filter
}
It doesn't make sense at all that this stage can take almost 1.7ms to render!
The simple fact everytime I interrupt I stop on a drmIoctl called by Render_generic3D looks stupid, this Render_generic3D function is for the simplest surfaces we have! Something looks to be badly wrong somewhere.
Knowing the starchart02 shader could also be rendered by the Render_lightMapping (it's just supposedly more complex than Render_generic3D), I did that:
void Render_generic( shaderStage_t *pStage )
{
if ( backEnd.projection2D )
{
Render_generic2D( pStage );
return;
}
+ if ( tess.surfaceShader->sort != Util::ordinal( shaderSort_t::SS_DEPTH ) )
+ {
+ Render_lightMapping( pStage );
+ }
clock_t start = clock();
Render_generic3D( pStage );
clock_t diff = clock() - start;
printf("=== time: %ld, %s\n", diff, tess.surfaceShader->name);
}
Of course it broke some other textures that cannot be rendered by Render_lightMapping, but I made sure it keeps the depth stage being rendered the usual way so outside of starchart02 I basically get some other textures being wrong but that's all.
And for starchart02, I got his time:
=== time: 72, textures/vega_custom/starchart02
And that texture looks to be prolerly rendered! The time to render it went from 2ms to 72ns! Something is definitely wrong in Render_generic3D, I may have found something big!
Is there some reason to believe that Render_generic3D is wrong, as opposed to the driver being crap?