AVX 256 functions accepting `__m{128,256}i_u` should accept `__m{128,256}i`
Version of emscripten/emsdk:
emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 4.0.6 (1ddaae4d2d6dfbb678ecc193bc988820d1fc4633)
clang version 21.0.0git (https:/github.com/llvm/llvm-project 4775e6d9099467df9363e1a3cd5950cc3d2fde05)
Target: wasm32-unknown-emscripten
Thread model: posix
InstalledDir: /data/home/sliphua/open/emsdk/upstream/bin
Failing command line in full: It fails to compile astc-encoder x86-64 AVX2 version. Below is the error and a working solution. The other steps to adapt to emscripten AVX seem irrelevant, so they are not written here.
The error:
$ cd ~/open/astc-encoder/build_avx2
$ emcmake cmake -G "Unix Makefiles" -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=../ -DASTCENC_ISA_AVX2=ON ..
$ emmake make
make: make
[ 2%] Building CXX object Source/CMakeFiles/astcenc-avx2-static.dir/astcenc_averages_and_directions.cpp.o
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_averages_and_directions.cpp:23:
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_internal.h:35:
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_mathlib.h:460:
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_vecmathlib.h:86:
/data/home/sliphua/open/astc-encoder/Source/astcenc_vecmathlib_avx2_8.h:137:7: error: no matching function for call to '_mm256_loadu_si256'
137 | m = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(p));
| ^~~~~~~~~~~~~~~~~~
/data/home/sliphua/open/emsdk/upstream/emscripten/cache/sysroot/include/compat/avxintrin.h:1276:1: note: candidate function not viable: no known conversion from 'const __m256i *' to 'const __m256i_u *' for 1st argument
1276 | _mm256_loadu_si256(__m256i_u const* __p) {
| ^ ~~~~~~~~~~~~~~~~~~~~
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_averages_and_directions.cpp:23:
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_internal.h:35:
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_mathlib.h:460:
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_vecmathlib.h:86:
/data/home/sliphua/open/astc-encoder/Source/astcenc_vecmathlib_avx2_8.h:189:16: error: no matching function for call to '_mm256_lddqu_si256'
189 | return vint8(_mm256_lddqu_si256(reinterpret_cast<const __m256i*>(p)));
| ^~~~~~~~~~~~~~~~~~
/data/home/sliphua/open/emsdk/upstream/emscripten/cache/sysroot/include/compat/avxintrin.h:1284:1: note: candidate function not viable: no known conversion from 'const __m256i *' to 'const __m256i_u *' for 1st argument
1284 | _mm256_lddqu_si256(__m256i_u const* __p) {
| ^ ~~~~~~~~~~~~~~~~~~~~
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_averages_and_directions.cpp:23:
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_internal.h:35:
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_mathlib.h:460:
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_vecmathlib.h:86:
/data/home/sliphua/open/astc-encoder/Source/astcenc_vecmathlib_avx2_8.h:522:2: error: no matching function for call to '_mm256_storeu_si256'
522 | _mm256_storeu_si256(reinterpret_cast<__m256i*>(p), a.m);
| ^~~~~~~~~~~~~~~~~~~
/data/home/sliphua/open/emsdk/upstream/emscripten/cache/sysroot/include/compat/avxintrin.h:1322:1: note: candidate function not viable: no known conversion from '__m256i *' to '__m256i_u *' for 1st argument
1322 | _mm256_storeu_si256(__m256i_u* __p, __m256i __a) {
| ^ ~~~~~~~~~~~~~~
3 errors generated.
make[2]: *** [Source/CMakeFiles/astcenc-avx2-static.dir/build.make:77: Source/CMakeFiles/astcenc-avx2-static.dir/astcenc_averages_and_directions.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:860: Source/CMakeFiles/astcenc-avx2-static.dir/all] Error 2
make: *** [Makefile:146: all] Error 2
emmake: error: 'make' failed (returned 2)
A working solution:
Manually adding the __m256i* overloads of the functions suggested by the error message to avxintrin.h resolves the error.
SIMDe uses void*, so it appears they don't have the issue.
https://github.com/simd-everywhere/simde/blob/2af3dce9b2481b6b32139b1022cdfc02a633c898/simde/x86/avx.h#L4361
Would you have time to send a PR?
Also, have you looked into disabled the AVX2 code in the library in question? The emulation path is unlikely to give you great results in wasm anyway.
Would you have time to send a PR?
I would like to, but I need to spend some time understanding why SIMDe chose void* and why __m{128,256}i_u works in GCC/Clang before I can submit a PR with a good solution.
Also, have you looked into disabled the AVX2 code in the library in question? The emulation path is unlikely to give you great results in wasm anyway.
Yes, astc-encoder has various build modes including sse2, sse4.1, avx2, neno, and "none" which use different SIMD instruction set. Except for "none", each mode requires some minor adjustments to the cmake or cpp files to be built successfully. I've managed to find the fastest mode by building all the modes and comparing the encoding speed of the same large image file, and they do have noticeable perfomance differences.
It's interesting that the SSE4.1 wasm build performs optimally across both x86-64 and ARM.
| Intel i7-13700 | Apple M3 Pro | |
|---|---|---|
| none (no intrinsics) | 7.49s | 3.57s |
| sse2 | 3.94s | 3.03s |
| sse4.1 | 3.51s | 2.53s |
| avx2 (without f16c) | 4.17s | 2.95s |
| neon | 4.25s | 2.86s |
Would you have time to send a PR?
I would like to, but I need to spend some time understanding why SIMDe chose
void*and why__m{128,256}i_uworks in GCC/Clang before I can submit a PR with a good solution.
@tlively ?
Also, have you looked into disabled the AVX2 code in the library in question? The emulation path is unlikely to give you great results in wasm anyway.
Yes,
astc-encoderhas various build modes including sse2, sse4.1, avx2, neno, and "none" which use different SIMD instruction set. Except for "none", each mode requires some minor adjustments to the cmake or cpp files to be built successfully. I've managed to find the fastest mode by building all the modes and comparing the encoding speed of the same large image file, and they do have noticeable perfomance differences.It's interesting that the SSE4.1 wasm build performs optimally across both x86-64 and ARM.
Intel i7-13700 Apple M3 Pro none (no intrinsics) 7.49s 3.57s sse2 3.94s 3.03s sse4.1 3.51s 2.53s avx2 (without f16c) 4.17s 2.95s neon 4.25s 2.86s
Are you talking about wasm performance or native build performance?
Are you talking about wasm performance or native build performance?
The wasm performance when running in Chrome 135. The table shows the avg time they took to encode a large image file.
@tlively are these number what we might expect? I suspect the reason is that sse4.1 is closest to wasm-simd and therefore has the smallest emulation overhead?
From #24095 , I'm curious whether a more ideal fix (i.e. more inline with other compilers) is to somehow unify the definitions of __m256i and __m256i_u. Other compilers (Xcode, GCC, Clang) do this, but in Emscripten it's a bit more complicated because the __m256i* types contain corresponding __m128i* types.
If we could somehow unify both definitions, then we could typedef both __m256i and __m256i_u to the same underlying type with only differing alignments. This obviates the need to define intrinsic overloads which might be cumbersome.
@sbc100 will something like this work?
typedef int64_t __m128i_u __attribute__((__vector_size__(16), __aligned__(1)));
typedef struct {
__m128i_u v0;
__m128i_u v1;
} __m256i_base;
typedef __m256i_base __m256i_u __attribute__((__aligned__(1)));
typedef __m256i_base __m256i __attribute__((__aligned__(32)));