opus LACE / NoLACE and DRED on Fixed Point implementations?

Hi,

The new ML algorithms in v1.5 are really impressive. It looks like they're only for implementations of OPUS that are Floating Point.

I'm compiling here for Xtensa LX6 (ESP32) which doesn't have a hard FPU and thus need the Fixed Point implementation to have any real-time audio encoding / decoding.

I haven't really dug into the code, but my guess is the networks are represented in and presented with floating point values.

make clean && ./configure CC=/Users/kevin/.espressif/tools/xtensa-esp32-elf/esp-2021r2-patch3-8.4.0/xtensa-esp32-elf/bin/xtensa-esp32-elf-gcc --host=xtensa --disable-extra-programs --enable-osce --disable-hardening --disable-doc --enable-asm --enable-fixed-point && make CC=/Users/kevin/.espressif/tools/xtensa-esp32-elf/esp-2021r2-patch3-8.4.0/xtensa-esp32-elf/bin/xtensa-esp32-elf-gcc

configure:
------------------------------------------------------------------------
  opus 1.5.1-dirty:  Automatic configuration OK.

    Compiler support:

      C99 var arrays: ................ yes
      C99 lrintf: .................... yes
      Use alloca: .................... no (using var arrays)

    General configuration:

      Floating point support: ........ no
      Fast float approximations: ..... no
      Fixed point debugging: ......... no
      Inline Assembly Optimizations: . No inline ASM for your platform, please send patches
      External Assembly Optimizations: 
      Intrinsics Optimizations: ...... no
      Run-time CPU detection: ........ no
      Custom modes: .................. no
      Assertion checking: ............ no
      Hardening: ..................... no
      Fuzzing: ....................... no
      Check ASM: ..................... no

      API documentation: ............. no
      Extra programs: ................ no
------------------------------------------------------------------------

 Type "make; make install" to compile and install
 Type "make check" to run the test suite

/Applications/Xcode.app/Contents/Developer/usr/bin/make  all-recursive
  CC       celt/bands.lo
  CC       celt/celt.lo
  CC       celt/celt_encoder.lo
  CC       celt/celt_decoder.lo
In file included from /Users/kevin/.espressif/tools/xtensa-esp32-elf/esp-2021r2-patch3-8.4.0/xtensa-esp32-elf/xtensa-esp32-elf/sys-include/string.h:180,
                 from celt/os_support.h:41,
                 from celt/celt_decoder.c:37:
celt/celt_decoder.c: In function 'celt_decode_lost':
celt/os_support.h:79:83: error: invalid operands to binary - (have 'float *' and 'celt_sig *' {aka 'int *'})
 #define OPUS_COPY(dst, src, n) (memcpy((dst), (src), (n)*sizeof(*(dst)) + 0*((dst)-(src)) ))
                                                                              ~~~~~^~~~~~
celt/celt_decoder.c:914:13: note: in expansion of macro 'OPUS_COPY'
             OPUS_COPY(buf_copy+c*overlap, &decode_mem[c][DECODE_BUFFER_SIZE-N], overlap);
             ^~~~~~~~~
celt/os_support.h:79:83: error: invalid operands to binary - (have 'float *' and 'celt_sig *' {aka 'int *'})
 #define OPUS_COPY(dst, src, n) (memcpy((dst), (src), (n)*sizeof(*(dst)) + 0*((dst)-(src)) ))
                                                                              ~~~~~^~~~~~
celt/celt_decoder.c:914:13: note: in expansion of macro 'OPUS_COPY'
             OPUS_COPY(buf_copy+c*overlap, &decode_mem[c][DECODE_BUFFER_SIZE-N], overlap);
             ^~~~~~~~~
make[2]: *** [celt/celt_decoder.lo] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all] Error 2

Mar 05 '24 05:03 expresspotato

Correct. All new DNN-based features are floating-point only. The reasoning is that most of the chips that are powerful enough to run that DNN code will also have an FPU. So at least for now (things can change) there's no plan to implement those in fixed-point.

Mar 05 '24 21:03 jmvalin

Here's my vote for fixed point support of all features. In my testing (mainly speech), the fixed point build of 1.5 uses only about 2/3 the cpu time of the floating point build when encoding complexity is above 5. My application is for a high-density server, so in moving to 1.5, I have to make a choice between a decrease in density to get PLC and LACE/NoLACE, or a significant increase in density if I used 1.5 fixed point and lose the new features.

Mar 12 '24 20:03 bateyejoe

On most modern chips floating-point should actually be faster than fixed-point. Maybe there's some optimization that isn't getting enabled.

Mar 12 '24 20:03 jmvalin

Try enable fast-math, float-approx and if you run a server with known hardware from this decade you can presume avx2 and sse 4.2 of your opus build.

Mar 12 '24 20:03 xnorpx

Try enable fast-math, float-approx and if you run a server with known hardware from this decade you can presume avx2 and sse 4.2 of your opus build.

Where do I find the "fast-math" option? float-approx is enabled. I have MAY_HAVE_SSE4_1 and MAY_HAVE_AVX2 enabled, but only presume up to SSE2. Could the run-time dispatching account for such a big difference? I can set those to PRESUME and give it a try. Testing on a Core i9-13900, btw.

Mar 13 '24 01:03 bateyejoe

Try enable fast-math, float-approx and if you run a server with known hardware from this decade you can presume avx2 and sse 4.2 of your opus build.

Where do I find the "fast-math" option? float-approx is enabled. I have MAY_HAVE_SSE4_1 and MAY_HAVE_AVX2 enabled, but only presume up to SSE2. Could the run-time dispatching account for such a big difference? I can set those to PRESUME and give it a try. Testing on a Core i9-13900, btw.

What build system are you using? Autotools, CMake or Meson?

Mar 13 '24 02:03 xnorpx

What build system are you using? Autotools, CMake or Meson?

Using our own cmake-based system. I started off with the linux build and generated a Makefile with configure. I used that to build our CMakeLists.txt with just the options we need. The only difference in options between the windows and linux builds was linux had VAR_ARRAY enabled and windows has ALLOCA enabled instead.

I just completed rebuilding with PRESUME for sse4.1 and avx2 and re-ran the benchmarks and now, to my surprise, the 1.5-fixed and 1.5-float results are much closer. Either my initial test run was flawed, or the PRESUME makes a pretty large difference. Will try going back to MAY_HAVE for sse4.1 and avx2 and let you know if that was really the difference.

Mar 13 '24 02:03 bateyejoe

@bateyejoe if you have custom then you are on your own :) you can look at the opus CMakefiles and see how it is enabling the following options.

OPUS_FLOAT_APPROX, enable floating point approximations (Ensure your platform supports IEEE 754 before enabling). OPUS_FAST_MATH, enable fast math (unsupported and discouraged use, as code is not well tested with this build option). OPUS_X86_PRESUME_SSE4_1, assume target CPU has SSE4.1 support (override runtime check). OPUS_X86_PRESUME_AVX2, assume target CPU has AVX FMA AVX2 support (override runtime check).

It's some defines and some compiler flags.

Mar 13 '24 02:03 xnorpx

It's possible you never actually enabled the RTCD, which would prevent the code from taking advantage of any of the MAY_HAVEs.

Mar 13 '24 02:03 jmvalin

It's possible you never actually enabled the RTCD, which would prevent the code from taking advantage of any of the MAY_HAVEs.

I think you're right. Switching back to MAY_HAVE-only still performs on par with the fixed version, so I obviously missed something in that first config.

In any case, I don't think fixed support is completely worthless on modern processors. As I understand it, with multithread cores, simultaneous execution of integer and float operations is possible, so having workloads with both integer and float math is beneficial. In our case, we already have quite a bit of float math going on which is one of the reasons we chose the Opus fixed build in the past.

Thanks for assistance.

Mar 13 '24 13:03 bateyejoe