astc-encoder icon indicating copy to clipboard operation
astc-encoder copied to clipboard

Dynamically enable CPU ISA features at runtime

Open solidpixel opened this issue 6 years ago • 8 comments

The current 2.0 prototype branch is statically enabling various levels of SSE support at build-time, and we'd like the option to use SSE4.2, POPCNT, and AVX2 features for critical data paths. We can assume that SSE2 is always available (mandatory for x86-64), but anything higher than that should be dynamically selected at runtime to avoid the need for per-target builds.

For testing purposes it must be possible to forcefully select a specific path; either at build time or runtime, so we can ensure good test coverage.

solidpixel avatar Jan 14 '20 12:01 solidpixel

GCC needs e.g. -mavx2 specified on the command line to allow AVX2 intrinsics, but then may also use AVX2 instructions elsewhere which breaks our "run anywhere" baseline of using SSE2 everywhere expect specialized functions. To fix this we'll need to pull the specialized routines for each ISA out into a dedicated file we can compile with e.g. -mavx2 and then use -msse2 for the baseline files.

MSVC can only support a single set of compiler flags for the whole project, but it doesn't complain about intrinsics which require a higher ISA feature set than the current project settings. For this we can simply compile for SSE2, and pick up the intrinsics in the separate files automatically.

solidpixel avatar Jan 18 '20 21:01 solidpixel

We should do this for 2.0 launch, but it can happen after beta.

solidpixel avatar Mar 11 '20 23:03 solidpixel

A quick investigation compiling the core C++ for sse2, only using sse4.2 or avx2 for functions using intrinsics, seems to leave a significant chunk of performance on the table.

g++ (7.5) seems about 5% slower with the core code compiled with -msse2 compared to -mavx2, even if we use all of the other specialized functions. clang++ (6.0) is better but still 2% slower. Some of this could be refined with better dynamic dispatch - the prototype was rough - but that doesn't seem likely to explain a large gap. Given compression performance is critical for this project it seems like we'll need to explore other options here - I don't want to lose 5% performance for a slight convenience boost and I can't force all users to use Clang.

The most obvious way forward would be to compile multiple variants of the entire core codec, each using optimal settings for one architecture, and just select the whole codec library to use at runtime. This will cost code size, but the project isn't huge and we could allow build-time specialization to compile-out unneeded variants.

solidpixel avatar Apr 13 '20 21:04 solidpixel

Attempted implementing this as distinct backend shared objects and still see a loss of 5% performance. Bumping this one to 2.1 so we get some time to investigate this one properly as the performance loss is large enough to hurt.

solidpixel avatar Jul 08 '20 20:07 solidpixel

In case some data is needed - in Unity context we're fine with building several versions of the encoder separately. However, we do want to build it as a library instead of executable, which in presence of several ISA choices means we have to build a dll/dylib/so. That by itself is fine, but the function entry points need to be marked as dllexport (msvc) or default visibility (gcc/clang). With something like:

diff --git a/Source/astcenc.h b/Source/astcenc.h
--- a/Source/astcenc.h
+++ b/Source/astcenc.h
@@ -155,6 +155,16 @@
 #include <cstddef>
 #include <cstdint>
 
+#ifdef ASTCENC_DYNAMIC_LIBRARY
+#   if defined(_MSC_VER)
+#       define ASTCENC_PUBLIC extern "C" __declspec(dllexport)
+#   else
+#       define ASTCENC_PUBLIC extern "C" __attribute__ ((visibility ("default")))
+#   endif
+#else
+#   define ASTCENC_PUBLIC
+#endif
+
 /* ============================================================================
     Data declarations
 ============================================================================ */
@@ -495,7 +505,7 @@ struct astcenc_image {
  * @return ASTCENC_SUCCESS on success, or an error if the inputs are invalid
  * either individually, or in combination.
  */
-astcenc_error astcenc_config_init(
+ASTCENC_PUBLIC astcenc_error astcenc_config_init(
 	astcenc_profile profile,
 	unsigned int block_x,
 	unsigned int block_y,
@@ -525,7 +535,7 @@ astcenc_error astcenc_config_init(
  *
  * @return ASTCENC_SUCCESS on success, or an error if context creation failed.
  */
-astcenc_error astcenc_context_alloc(
+ASTCENC_PUBLIC astcenc_error astcenc_context_alloc(
 	const astcenc_config& config,
 	unsigned int thread_count,
 	astcenc_context** context);
@@ -548,7 +558,7 @@ astcenc_error astcenc_context_alloc(
  *
  * @return ASTCENC_SUCCESS on success, or an error if compression failed.
  */
-astcenc_error astcenc_compress_image(
+ASTCENC_PUBLIC astcenc_error astcenc_compress_image(
 	astcenc_context* context,
 	astcenc_image& image,
 	astcenc_swizzle swizzle,
@@ -568,7 +578,7 @@ astcenc_error astcenc_compress_image(
  *
  * @return ASTCENC_SUCCESS on success, or an error if reset failed.
  */
-astcenc_error astcenc_compress_reset(
+ASTCENC_PUBLIC astcenc_error astcenc_compress_reset(
 	astcenc_context* context);
 
 /**
@@ -582,7 +592,7 @@ astcenc_error astcenc_compress_reset(
  *
  * @return ASTCENC_SUCCESS on success, or an error if decompression failed.
  */
-astcenc_error astcenc_decompress_image(
+ASTCENC_PUBLIC astcenc_error astcenc_decompress_image(
 	astcenc_context* context,
 	const uint8_t* data,
 	size_t data_len,
@@ -594,7 +604,7 @@ astcenc_error astcenc_decompress_image(
  *
  * @param context   The codec context.
  */
-void astcenc_context_free(
+ASTCENC_PUBLIC void astcenc_context_free(
 	astcenc_context* context);
 
 /**
@@ -604,7 +614,7 @@ void astcenc_context_free(
  *
  * @return A human readable nul-terminated string.
  */
-const char* astcenc_get_error_string(
+ASTCENC_PUBLIC const char* astcenc_get_error_string(
 	astcenc_error status);
 
 #endif

aras-p avatar Oct 28 '20 11:10 aras-p

@aras-p Super, thanks.

I probably won't get around to looking at the DLL usage for the command line tool for 2.5, but certainly adding the ability for others to build it this way makes sense.

solidpixel avatar Mar 03 '21 09:03 solidpixel

Public API change to make it pure C, and add symbol visibility annotation support, merged in cea746f. Remaining work not in scope for 2.5, so dropping it off that milestone for now.

solidpixel avatar Mar 03 '21 13:03 solidpixel

I would be interested in runtime simd selection too.

malytomas avatar Dec 03 '21 13:12 malytomas