OpenImageIO Fix fast_exp simd test when building with aggressive optimization flags

Description

After building OIIO on my machine and running the test suite, I noticed that I had some failures. All failures besides the simd test was fixed by adding -ffp-contract=off to my compile flags. (IIRC, GCC adds -ffp-contract=on when compiling with more aggressive flags while Clang and the intel compilers seems to not add this)

As only the single fast_exp test is failing, I'm guessing that adding a threshold to this test is fine? Or is this expected to have no deviation at all no matter what?

I've tested this on three different x86_64 Linux computers (AMD/Intel CPUs). All fail in the same way and this fix seem to remedy the fast_exp failure.

My compile flags are -march=native -O2 -pipe -ffp-contract=off. The simd test passes without this patch if I remove -march=native.

Checklist:

[x] I have read the contribution guidelines.

I haven't signed the CLA yet, but I can if the change is deemed applicable. I also don't mind someone committing this in their name as the change is so minimal.

Nov 27 '23 14:11 DarkDefender

:x: - login: @DarkDefender . The commit (5096eb4150b37590ccb04067449e874b52554b05) is not authorized under a signed CLA. Please click here to be authorized. For further assistance with EasyCLA, please submit a support request ticket.

Nov 27 '23 14:11 linux-foundation-easycla[bot]

There's a related discussion on this in https://github.com/AcademySoftwareFoundation/OpenImageIO/issues/3890#issuecomment-1606538525.

In that one lg said:

The solution in this case is simple: just bump the tolerance a teeny bit. The trick is that without an ARM machine to test on, it's hard to know how much. (Another possible route is to really dissect these functions and figure out exactly where and why the scalar and simd implementations differ and decide if we should fix them. But considering it only affects these particular functions that are specifically advertised as being approximate, I'm not sure how much trouble it is worth to ensure it's bit identical.)

So it looks like you've done the work of at least testing the tolerance needed for x86_64? Is this the smallest tolerance you could get away with, say within a factor of 2?

Nov 27 '23 17:11 ThiagoIze

In this case, I just copied the epsilon value from the log test above it. I can try to do a thorough examination of the difference in the output values though if needed. However that tolerance might change on other architectures, so I don't know how low we would want to go to still have some leeway.

Nov 27 '23 21:11 DarkDefender

All architectures I believe will give the same result provided the compiler generated the same math instructions, which it should if all the archs have the same math instructions enabled and you're not enabling destructive optimizations like -ffast-math. In other words, I suspect what you find on x86_64 will be identical to ARM when both fully enable or disable FMA.

But thinking about this some more, I suspect the issue here might be that the SIMD version is doing the FMA while the scalar, as you requested, is not (or maybe it's the other way around?). So maybe what you want is to force OIIO to do the same thing for both scalar and SIMD. So in this case disable FMA everywhere (hack OIIO to not define OIIO_FMA_ENABLED if you set -ffp-contract=off) or we modify the madd(float) variant to explicitly do a scalar FMA operation (call _mm_fmadd_ss) instead of relying on pragmas which get disabled when you set -ffp-contract=off.

Disabling FMA in OIIO will give you consistent results and let the testsuite pass and you'll get the same results (at least with respect to FMA changes) between scalar and SIMD, but it makes all these SIMD functions a tiny bit slower, which is a shame, though possibly not measurable.

The _mm_fmadd_ss option would I think keep us using FMA for scalar when we explicitly request it but prevent the compiler from sneaking it in and breaking the testsuite. If this works, that's the option I would go with (along with the neon equivalent).

I'm not on an x86_64 machine at the moment so can't currently try these out myself.

Also, you mentioned disabling contraction was needed for fixing other testsuite fails? Has that been reported? It sounds like we should also fix those. Maybe in the process of fixing the rest of these issues it'll solve this issue as well.

Nov 28 '23 01:11 ThiagoIze

Just to be clear on when and how the unit_simd test fails. It fails for me with march=native regardless if -ffp-contract=off is set or not. These is the values that the SIMD and non SIMD code returns (same result with or without -ffp-contract=off):

 0.3678778707981110 1.0000000000000000 2.7182760238647461 90.0169906616210938
 0.3678778409957886 1.0000000000000000 2.7182760238647461 90.0169906616210938

It seems like it is only the first value that has any noticeable deviation with these many decimal points.

I tested out replacing the madd with _mm_fmadd_ss and it does make the unit_simd test pass when compiling with march=native for me. However if we go this route then we need to check for if the compiler has turned on SSE optimizations as compiling with just -O2 -pipe will fail.

In addition to that, does the test actually make any sense if we change the regular code to use SIMD instructions? Because then we are actually just checking if the code is using the exact same SIMD instructions.

To me it seems like GCC replaces the regular madd call with something that is not the exact same SIMD instruction OIIO is using when using more aggressive optimization levels. Which I think is fine if the result doesn't deviate enough from the expected one.

Also, you mentioned disabling contraction was needed for fixing other testsuite fails? Has that been reported? It sounds like we should also fix those. Maybe in the process of fixing the rest of these issues it'll solve this issue as well.

I have not reported those at the moment. These are the tests that were failing for me with -march=native:

texture-crop
texture-interp-bilinear
texture-interp-closest
texture-mip-stochasticaniso
texture-uint8
texture-skinny
texture-crop.batch
texture-half.batch
texture-uint16.batch
texture-interp-bilinear.batch
texture-interp-closest.batch
texture-levels-stochaniso.batch
texture-levels-stochmip.batch
texture-mip-onelevel.batch
texture-mip-stochastictrilinear.batch
texture-mip-stochasticaniso.batch
texture-uint8.batch
texture-skinny.batch
tiff-depths
unit_simd

Visually they seem fine, but the reported deviation seems to be a bit outside of what the tests have been setup to accept.

Nov 29 '23 16:11 DarkDefender

I tested out replacing the madd with _mm_fmadd_ss and it does make the unit_simd test pass when compiling with march=native for me. However if we go this route then we need to check for if the compiler has turned on SSE optimizations as compiling with just -O2 -pipe will fail.

Glad to hear that works. I believe you can create two code paths controlled by the appropriate ifdefs. So when that's true, you do the manual fmadd and when it's false you do the regular scalar code we currently have. I started working on that and got it mostly working, but the code it generated on msvc and gcc was not optimal. I'm sure one day those compilers will optimize the intrinsics properly but I realized there's a much better solution.

math.h has the fmaf() function which does what we want. this means we can replace all the code with a call to fmaf() when we know there's hardware support and we get exactly what we want. When there's not hw support, we do not want to call this because it introduces overhead (there's a function call AND it's doing extra computation in order to get the same precision as what the hw provides), so we just do regular scalar code.

I tested this out in https://gcc.godbolt.org/z/bn4G6zzxh, which shows clang, gcc, and msvc doing the appropriate thing with arm/x86_64 code both when the hw supports it and when it doesn't.

What do you think? Does this solve your issue?

Dec 04 '23 02:12 ThiagoIze

Thanks for the suggestion! :) I tried it and it does work for me with march=native -O2 -pipe. However if I switch back to simply -O2 -pipe, then I don't get any compile errors like I did with _mm_fmadd_ss however the test fails.

I'm a bit unsure what is happening as even if the defined(__AVX2__) for some reason is true, then it should still be the same result as the simd code path it is testing against, right?

Dec 08 '23 16:12 DarkDefender

I don't know which code path your compiler is taking or exactly how you have it coded up. Can you place some #pragma message ( "this version is called..." ) messages in the various code paths for the simd and scalar versions of these functions and then report back what simd and scalar code is being compiled? And what ifdefs are you using? I assume that if the test fails that one of these is running the fma version and the other is not.

Dec 09 '23 01:12 ThiagoIze

@DarkDefender , std::fmaf(...) inside madd solves the issue with gcc -ffp-contract=off (also you can remove OIIO_CLANG_PRAGMA(clang fp contract(fast)) there; it is not needed with std::fmaf).

I don't see any issues with pure -O2 (no -march, pure x86-64) with gcc 13.2.1 and clang 17.0.6. Maybe you have some extra system-wide flags from /etc/clang/gentoo-hardened.cfg or whichever profile you use.

Jan 20 '24 11:01 AngryLoki

This looks like the right fix. I can't merge because there's no CLA sign-off, but independently I ended up with essentially the same fix as part of #4187, so this fix will get incorporated through that merge.

Mar 16 '24 15:03 lgritz