opus Questions about the time-consuming of Mac & iOS

I want to ask a question about the time-consuming of opus_decode after compiling on the Mac & iOS platform. Arm neon inference is in the existing code (vec_neon.h). I measured the time consumption of running opus_demo(with compiling --enable_deep_plc) on a mac m3pro laptop for the opus_decode function(only happens when losing the packet), and the average time consumption is about 0.3ms, while on an iOS device (iphone12pro max) the average time is 2.5ms, and the peak time is 7ms (only 1.5ms for without enable_deep_plc). Both are on the Release mode. I've already used FARGAN. I add the -march=armv8.2-a+dotprod option.

Has anybody ever measure the time of using dnn network online? Here I post my cmake compile configure shell. I didn't modify the CMakeLists.txt. For mac m3pro:

SRC_DIR="../opus-1.5.2_cmake"
BUILD_DIR="./build"
LIB_DIR="./output"

printf "=== start config arm64 ===\n"
printf "cur dir: ${PWD}\n"

rm -rf $BUILD_DIR
cmake ${SRC_DIR} -B ${BUILD_DIR} \
      -DOPUS_DEEP_PLC=ON \
      -DOPUS_BUILD_PROGRAMS=ON \
      -DCMAKE_OSX_ARCHITECTURES="x86_64;arm64" \
      -DCMAKE_OSX_DEPLOYMENT_TARGET="10.15"\
      -DCMAKE_XCODE_ATTRIBUTE_ONLY_ACTIVE_ARCH=NO \
      -DCMAKE_BUILD_TYPE=Release \
cmake --build ${BUILD_DIR} --target opus

For iOS:

SRC_DIR="./opus-1.5.2"
BUILD_DIR="./build"
LIB_DIR="./libs_ios_load"

printf "cur dir: ${PWD}\n"
rm -rf $BUILD_DIR
cmake ${SRC_DIR} -B ${BUILD_DIR} \
      -DOPUS_DEEP_PLC=ON \
      -DCMAKE_BUILD_TYPE=Release \
      -DCMAKE_SYSTEM_NAME=iOS \
      -DCMAKE_OSX_ARCHITECTURES="arm64" \
      -DCMAKE_XCODE_ATTRIBUTE_ONLY_ACTIVE_ARCH=NO \
      -DCMAKE_C_FLAGS="-march=armv8.2-a+dotprod" \
      -DCMAKE_CXX_FLAGS="-march=armv8.2-a+dotprod" \
cmake --build ${BUILD_DIR} --target opus

Is my compile option wrong? Why is my opus_decode time consumption so high? Is "-march=armv8.2-a+dotprod" option the default in CMakeLists.txt?

Jul 28 '25 07:07 YYX666660

Hi @jmvalin :) I've checked the flags on the iOS device platform. I'm sure that I was running Neon/dotprod code instead of scalar code. When I was running on the xcode, I added some "printf" to check the status. And the results are as follows:

The average time consumption of opus_decode is about 2ms (iphone12pro max) and the peak time is 4.5-5ms. This duration is still long. I tried to decrease PLC_BUF_SIZE ((CONT_VECTORS+10)*FRAME_SIZE) to PLC_BUF_SIZE ((CONT_VECTORS+5)*FRAME_SIZE) to decrease the while loop in lpcnet_plc_conceal. It could help a little, but the duration hasn't decreased significantly.

So what can I do for vec_neon.h?

Jul 31 '25 12:07 YYX666660

To be sure, you could try changing dnn/vec.h not to use vec_neon.h ever. For example, changing the #elif (defined(__ARM_NEON__) || defined(__ARM_NEON)) && !defined(DISABLE_NEON) to #elif 0 and seeing how slow things become.

Aug 15 '25 19:08 jmvalin

OK~ I'll test it soon.

Aug 17 '25 06:08 YYX666660

Hi @jmvalin :) I've tested the results using vec.h (totally turn vec_neon.h off)

using vec.h (totally turn vec_neon.h off): The average time consumption of opus_decode is about 10ms (iphone12pro max) and the peak time is 15ms
using vec_neon.h: The average time is about 2.5ms and the peak time is 5ms So time of neon off:on=3:1? Is the ratio right?

Aug 19 '25 12:08 YYX666660

3:1 seems a bit on the low side, but not extreme. Maybe you want to compare that with other platforms. Also, you can use the same technique to check how much difference dotprod makes.

Aug 19 '25 15:08 jmvalin

Thanks @jmvalin :), I've measured the time of Mac and iOS platform. The results are as follows:

Avg time of opus_decode	Mac m3 pro	iOS iphone12promax on xcode
no neon	4.5ms	10ms
neon + no prodot	0.46ms	3.5ms
neon + prodot	0.31ms	2.5ms

It seems that the gap between on/off neon on Mac is much bigger than iOS. If I need lower time of "neon + prodot", what can I do for vec_neon.h?

My time measuring code is here, I only measure the time when losing packet:

About the time of first doing opus_decode when packet lost

By the way, I noticed that the first time of the opus_decode is 5 times than the subsequent ones. The reason maybe here. For example, I have 4 lost packets for 80ms. The first packet lost time of opus_decode is nearly 5 times than the subsequent ones. Is it too slow?

Aug 20 '25 07:08 YYX666660

I think it might be interesting to do the timings at a lower level than opus_decoder_dred_decode() to see if all functions behave the same or just some of them.

Aug 26 '25 15:08 jmvalin

Well, note that I didn't use dred and the top level function is opus_decode. Follow your suggestion, I've measured the lower level function lpcnet_plc_conceal, the comparison is as follows: For example, I have 4 lost packets for 80ms. opus_decode takes 20ms for one round and lpcnet_plc_conceal takes 10ms. Similarly, the first packet lost time is obviously lower than the subsequent ones, and the gap of lpcnet_plc_conceal is bigger. So how did this happen? Then I measure the time in lpcnet_plc_conceal specifically when first packet losts. The most time-consuming module is in the "while loop". Since the PLC_BUF_SIZE is the critical value that affects PLC performance. There seems to be no solution for this.

In conclusion, there seems to be two ways:

Reduce the single inference time of the NN module, which comes back to neon. What can I do for neon...?
Reduce the buffer (will affects PLC performance)

Aug 29 '25 08:08 YYX666660

Can you measure one step more precisely? Is it compute_plc_pred() taking most of that time or some other function?

Aug 31 '25 14:08 jmvalin

Sure~ The following is the average running time of each type of function. It seems that fargan_cont() or fargan_synthesize_int() takes the most time. Their essential implementation is the same.

model	function	time
PitchDNN	lpcnet_compute_single_frame_features_float()	0.025ms
PLCModel	compute_plc_pred()	0.009ms
FARGAN	fargan_cont() or fargan_synthesize_int()	0.055ms

However, the peak time depends on the "while loop", which contains lpcnet_compute_single_frame_features_float() and compute_plc_pred(). The number of loops may not be reduced because it is related to the PLC_BUF_SIZE, and a small buffer will affect performance. I have no idea how to reduce it. Any solutions? @jmvalin

Sep 01 '25 08:09 YYX666660

Hi~ @jmvalin
Could you please check my experiments above? Is there any solution for the most time-consuming function fargan_cont() or fargan_synthesize_int()?

Sep 29 '25 02:09 YYX666660

I mean the latest timings you posted (e.g. FARGAN taking 0.055 ms) seem to describe good performance, no?

Nov 21 '25 20:11 jmvalin

Yes, the form above has good performance, which keeps the PLC_BUF_SIZE the default number(#define PLC_BUF_SIZE ((CONT_VECTORS+10)*FRAME_SIZE)). If I modify it to ((CONT_VECTORS+5)*FRAME_SIZE)), the time would decrease and the PLC performance suffers. Therefore, I haven't found a way to further reduce the time consumption while maintaining performance. By the way, in the fargan paper, the existing fargan model is compared to a "small fargan" version. The small fargan version is about 500k-weight model. Maybe using the "small fargan" model can reduce operating costs and time. Is this model available?

Nov 22 '25 02:11 YYX666660

What did you have to do to get from the original numbers you gave (e.g. 2.5ms on iphone12) to the latest ones?

Nov 23 '25 13:11 jmvalin

I think there are some misunderstandings here. The original and latest timings are from the same version code. The latest timings I posted (FARGAN 0.055 ms, PLCModel 0.009ms, PitchDNN 0.025ms) is the average time taken to execute once. It is the time consumption of each network function.

However, in the function lpcnet_plc_conceal(), the three networks will be executed multiple times, resulting in cumulative time in function opus_decode(), which resulting the average time of 2.5ms. It is the time consumption of the complete function opus_decode().

Therefore, now I need to solve the problem of the FARGAN network with the longest single-run time in order to reduce the overall time. So I need a small version of FARGAN.

Sure~ The following is the average running time of each type of function. It seems that fargan_cont() or fargan_synthesize_int() takes the most time. Their essential implementation is the same.

model function time PitchDNN lpcnet_compute_single_frame_features_float() 0.025ms PLCModel compute_plc_pred() 0.009ms FARGAN fargan_cont() or fargan_synthesize_int() 0.055ms However, the peak time depends on the "while loop", which contains lpcnet_compute_single_frame_features_float() and compute_plc_pred(). The number of loops may not be reduced because it is related to the PLC_BUF_SIZE, and a small buffer will affect performance. I have no idea how to reduce it. Any solutions? @jmvalin

Nov 24 '25 08:11 YYX666660

It's only compute_plc_pred() that gets called multiple times (on the first loss only). FARGAN gets called exactly once for every 10 ms of concealed audio.

Nov 24 '25 15:11 jmvalin

Yes, I got it in the previous picture posted earlier. The first loss accounts for the majority of the time, resulting in a high average latency(e.g. 80ms of data loss, 0.69ms for the first 20ms lost, and 0.14ms for the rest. Tested on Mac M3pro). compute_plc_pred() is executed in a very short time (called once), it is the while loop that maters. The PLC_BUF_SIZE maters.

Therefore, there may be two ways to reduce the time consumption, right?

Reduce PLC_BUF_SIZE (It will reduce PLC performance) to decrease time for the first loss
Using small version FARGAN to decrease time for the rest loss

Yes, the form above has good performance, which keeps the PLC_BUF_SIZE the default number(#define PLC_BUF_SIZE ((CONT_VECTORS+10)*FRAME_SIZE)). If I modify it to ((CONT_VECTORS+5)*FRAME_SIZE)), the time would decrease and the PLC performance suffers. Therefore, I haven't found a way to further reduce the time consumption while maintaining performance. By the way, in the fargan paper, the existing fargan model is compared to a "small fargan" version. The small fargan version is about 500k-weight model. Maybe using the "small fargan" model can reduce operating costs and time. Is this model available?

Nov 25 '25 03:11 YYX666660