Daemon qmath: micro-optimize BoxOnPlaneSide function

Extracted from:

https://github.com/DaemonEngine/Daemon/pull/1125

This function is heavily used in CPU culling and milking the maximum of performance from it is welcome.

Patches are meant to be squashed.

May 13 '24 10:05 illwieckz

About this:

	bool bit0 = dist[ 0 ] >= p->dist;
	bool bit1 = dist[ 1 ] < p->dist;
	return bit0 | ( bit1 << 1 );

We can also rewrite the last line this way:

	return bit0 + ( bit1 * 2 );

The disassembly is the same in both cases:

//	return bit0 | ( bit1 << 1 );
	lea          eax, [rax + rcx*2]
	ret

//	return bit0 + ( bit1 * 2 );
	lea          eax, [rax + rcx*2]
	ret

May 13 '24 10:05 illwieckz

MSVC prefers:

	return bit0 + ( bit1 * 2 );

Otherwise it aborts compilation and prints that:

src\engine\qcommon\q_math.cpp(781,23):
  error C2220: the following warning is treated as an error
  [build\engine-lib.vcxproj]
src\engine\qcommon\q_math.cpp(781,23):
  warning C4805: '|': unsafe mix of type 'bool' and type 'int' in operation
  [build\engine-lib.vcxproj]

May 13 '24 14:05 illwieckz

Ah right, dist[ 1 ] < p->dist can be false if p->dist is 0.

What I understand is that if p->signbits >= 8, then dist[ 0 ] = dist[ 1 ] = 0 (as the old comment said), then dist[ 0 ] >= p->dist is false assuming p->dist is always positive, so first bit is 0, so the only remaining test is dist[ 1 ] < p->dist for the second bit.

It looks like I misread the second comparison, I should have done:

	if ( p->signbits >= 8 )
	{
		return 2 * ( 0 < p->dist );
	}

Let's try with ASSERT_EQ( p->signbits & 7, 7 ).

May 13 '24 16:05 illwieckz

I did ASSERT( p->signbits < 8 ) instead. I loaded some maps and moved around and haven't caught an assertion.

May 13 '24 16:05 illwieckz

I added a commit to encourage the compilator to vectorize the &1 operation. This produced less instructions (on the right):

20240513-185921-001 unvanquished-orbit

May 13 '24 17:05 illwieckz

No, that last commit breaks the culling. I was storing to a bool…

May 13 '24 17:05 illwieckz

Now it works, without breaking the culling:

20240513-185811-002 unvanquished-orbit

May 13 '24 17:05 illwieckz

Just for fun I tried making an SSE version. I don't have much of an idea when this stuff is faster though, so I wouldn't suggest using it unless you really have a good way of benchmarking.

int BoxOnPlaneSide(const vec3_t emins, const vec3_t emaxs, const cplane_t* p)
{
	ASSERT_LT(emins[0], emaxs[0]);
	ASSERT_LT(emins[1], emaxs[1]);
	ASSERT_LT(emins[2], emaxs[2]);

	auto mins = sseLoadVec3Unsafe(emins);
	auto maxs = sseLoadVec3Unsafe(emaxs);
	auto normal = sseLoadVec3Unsafe(p->normal);

	auto prod0 = _mm_mul_ps(maxs, normal);
	auto prod1 = _mm_mul_ps(mins, normal);

	auto pmax = _mm_max_ps(prod0, prod1);
	auto pmin = _mm_min_ps(prod0, prod1);

	ALIGNED(16, vec4_t pmaxv);
	ALIGNED(16, vec4_t pminv);
	_mm_store_ps(pmaxv, pmax);
	_mm_store_ps(pminv, pmin);

	float dist0 = pmaxv[0] + pmaxv[1] + pmaxv[2];
	float dist1 = pminv[0] + pminv[1] + pminv[2];
	return (dist0 > p->dist) + 2 * (dist1 < p->dist);
}

May 14 '24 07:05 slipher

I added another change that helps the compiler to vectorize the multiplication.

Actually those lines can be completely removed by rewriting the BoxOnPlaneSide function to directly take a vec3_t[2] as input:

	vec3_t bounds[ 2 ];
	VectorCopy( emins, bounds[ 0 ] );
	VectorCopy( emaxs, bounds[ 1 ] );

But this would require a more intrusive change modifying many files (even the game code relies on this function).

Nevertheless, the compiler still produces less instructions. In fact it's possible that the copy is avoided because most engine calls to this function are actually using emins and emaxs pairs that are already bounds[ 0 ] and bounds[ 1 ] in the calling code.

On the left is before the multiplication vectorization, In the center is after the multiplication vectorization, On the right is after the function has been modified to avoid a copy (patch not included in this branch):

20240514-094535-000 unvanquished-cutter

We can see there is the same amount of instructions in the center on the right, but the instructions are a bit different, the engine can really benefit by always using everywhere a bounds_t type containing boths mins and maxs, but that's out of scope of that branch.

May 14 '24 08:05 illwieckz

My duplication of p->normal was an attempt to see if the compiler could do a single 8-sized vector multiplication instead of two 4-size vector multiplications, it looks like even with a native build on my modern system it does not. Worst than that, it does more instructions when doing two 4-size vector multiplications on the same data (it could simply drop the copy if useless). So I deduplicated and it seems to not do worse.

On the left: before vector multiplication, on the right: after vector multiplication, wihout normal duplication, note that the lines are shifted up by one because the compiler even saved on one arg compared to prior the vectorization:

20240514-105550-000 unvanquished-cutter

May 14 '24 09:05 illwieckz

I believe I cannot do more without going out of the scope of this function.

May 14 '24 09:05 illwieckz

Just for fun I tried making an SSE version. I don't have much of an idea when this stuff is faster though, so I wouldn't suggest using it unless you really have a good way of benchmarking.

Ah great! Once everything optimizing R_RecursiveWorldNode is merged it will be fun to compare.

May 14 '24 09:05 illwieckz

A slightly higher FPS, less CPU usage spent in both R_RecursiveWorldNode and BoxOnPlaneSide functions (comparatively to the rest):

20240514-114930-000 unvanquished-orbit

I reproduce it on multiple runs.

I also tested the early returns for the simple cases before the SSE code and it was not faster than just replacing the whole body.

May 14 '24 10:05 illwieckz

I included the SSE code with a #ifdef idx86_sse.

The non-SSE code will benefit other architectures (nacl, arm64).

May 14 '24 10:05 illwieckz

I also verified the code behaved the same before this branch, with my optimizations, and with slipher's SSE code.

What I do for this is that I go inside ATCSHD central room, set r_lockPVS 1 then navigate the map and look which surfaces are rendered and which aren't. The exact same surfaces are rendered and not rendered with the 3 code variants.

May 14 '24 10:05 illwieckz

The generated code comparison between my optimized code and slipher's SSE code 🤪️:

20240514-122612-000 unvanquished-cutter

May 14 '24 10:05 illwieckz

Another thing that happens with that SSE code is that it is now small enough to be inlined when enabling LTO:

20240514-130316-000 unvanquished-orbit

The non-SSE code doesn't get inlined when enabling LTO.

May 14 '24 11:05 illwieckz

Cool, the compiler did some further vectorization of the "horizontal add" and got rid of the _mm_store_ps. (Horizontal add is jargon for adding up the elements of a single SSE vector.) I had looked at some horizontal add code here but wasn't sure that it's worthwhile for summing only 3 (rather than 4) elements.

I propose a documentation update since the function can now return 0 in the unlikely case of a zero-size box. I looked at the callsites and they seem OK as-is - a 0 return won't harm them.

/*
 * ==================
 * BoxOnPlaneSide
 *
 * Function used to rule out nonzero-volume intersection of an AABB and a half-space.
 * Returns a 2-bit mask.
 * 
 * If BIT( 0 ) is clear, the box definitely has zero volume on the "front" side of the 
 *    plane (i.e. the side further in the direction of the normal vector).
 * If BIT( 1 ) is clear, the box definitely has zero volume on the back side of the plane.
 * ==================
 */

Cleaned-up SSE version with types and some comments:

int BoxOnPlaneSide(const vec3_t emins, const vec3_t emaxs, const cplane_t* p)
{
	__m128 mins = sseLoadVec3Unsafe( emins );
	__m128 maxs = sseLoadVec3Unsafe( emaxs );
	__m128 normal = sseLoadVec3Unsafe( p->normal );

	// Element-wise products
	__m128 prod0 = _mm_mul_ps( maxs, normal );
	__m128 prod1 = _mm_mul_ps( mins, normal );

	// Element-wise min/max
	__m128 pmax = _mm_max_ps( prod0, prod1 );
	__m128 pmin = _mm_min_ps( prod0, prod1 );

	ALIGNED( 16, vec4_t pmaxv );
	ALIGNED( 16, vec4_t pminv );
	_mm_store_ps( pmaxv, pmax );
	_mm_store_ps( pminv, pmin );

	// Maximum dot product with p->normal over the 8 box corners
	float dist0 = pmaxv[ 0 ] + pmaxv[ 1 ] + pmaxv[ 2 ];
	// Minimum dot product with p->normal over the 8 box corners
	float dist1 = pminv[ 0 ] + pminv[ 1 ] + pminv[ 2 ];

	return ( dist0 > p->dist ) + 2 * ( dist1 < p->dist );
}

May 14 '24 18:05 slipher

I rebased and squashed the step-by-step commits.

I have written some useful comment in q_math: micro-optimize the BoxOnPlaneSide function, documenting the two functionnal differences compared to the original code:

the removal of the p->signbits >= 8 condition,
the rewrite of dist[0] >= p->dist into dist[0] > p->dist.

This looks ready to me.

May 14 '24 20:05 illwieckz

LGTM

May 14 '24 20:05 slipher