Questions about the time-consuming of Mac & iOS
I want to ask a question about the time-consuming of opus_decode after compiling on the Mac & iOS platform.
Arm neon inference is in the existing code (vec_neon.h). I measured the time consumption of running opus_demo(with compiling --enable_deep_plc) on a mac m3pro laptop for the opus_decode function(only happens when losing the packet), and the average time consumption is about 0.3ms, while on an iOS device (iphone12pro max) the average time is 2.5ms, and the peak time is 7ms (only 1.5ms for without enable_deep_plc). Both are on the Release mode.
I've already used FARGAN. I add the -march=armv8.2-a+dotprod option.
Has anybody ever measure the time of using dnn network online? Here I post my cmake compile configure shell. I didn't modify the CMakeLists.txt. For mac m3pro:
SRC_DIR="../opus-1.5.2_cmake"
BUILD_DIR="./build"
LIB_DIR="./output"
printf "=== start config arm64 ===\n"
printf "cur dir: ${PWD}\n"
rm -rf $BUILD_DIR
cmake ${SRC_DIR} -B ${BUILD_DIR} \
-DOPUS_DEEP_PLC=ON \
-DOPUS_BUILD_PROGRAMS=ON \
-DCMAKE_OSX_ARCHITECTURES="x86_64;arm64" \
-DCMAKE_OSX_DEPLOYMENT_TARGET="10.15"\
-DCMAKE_XCODE_ATTRIBUTE_ONLY_ACTIVE_ARCH=NO \
-DCMAKE_BUILD_TYPE=Release \
cmake --build ${BUILD_DIR} --target opus
For iOS:
SRC_DIR="./opus-1.5.2"
BUILD_DIR="./build"
LIB_DIR="./libs_ios_load"
printf "cur dir: ${PWD}\n"
rm -rf $BUILD_DIR
cmake ${SRC_DIR} -B ${BUILD_DIR} \
-DOPUS_DEEP_PLC=ON \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_SYSTEM_NAME=iOS \
-DCMAKE_OSX_ARCHITECTURES="arm64" \
-DCMAKE_XCODE_ATTRIBUTE_ONLY_ACTIVE_ARCH=NO \
-DCMAKE_C_FLAGS="-march=armv8.2-a+dotprod" \
-DCMAKE_CXX_FLAGS="-march=armv8.2-a+dotprod" \
cmake --build ${BUILD_DIR} --target opus
Is my compile option wrong? Why is my opus_decode time consumption so high? Is "-march=armv8.2-a+dotprod" option the default in CMakeLists.txt?
Hi @jmvalin :) I've checked the flags on the iOS device platform. I'm sure that I was running Neon/dotprod code instead of scalar code. When I was running on the xcode, I added some "printf" to check the status. And the results are as follows:
The average time consumption of opus_decode is about 2ms (iphone12pro max) and the peak time is 4.5-5ms. This duration is still long.
I tried to decrease PLC_BUF_SIZE ((CONT_VECTORS+10)*FRAME_SIZE) to PLC_BUF_SIZE ((CONT_VECTORS+5)*FRAME_SIZE) to decrease the while loop in lpcnet_plc_conceal. It could help a little, but the duration hasn't decreased significantly.
So what can I do for vec_neon.h?
To be sure, you could try changing dnn/vec.h not to use vec_neon.h ever. For example, changing the
#elif (defined(__ARM_NEON__) || defined(__ARM_NEON)) && !defined(DISABLE_NEON)
to
#elif 0
and seeing how slow things become.
OK~ I'll test it soon.
Hi @jmvalin :)
I've tested the results using vec.h (totally turn vec_neon.h off)
- using vec.h (totally turn vec_neon.h off): The average time consumption of opus_decode is about 10ms (iphone12pro max) and the peak time is 15ms
- using vec_neon.h: The average time is about 2.5ms and the peak time is 5ms So time of neon off:on=3:1? Is the ratio right?
3:1 seems a bit on the low side, but not extreme. Maybe you want to compare that with other platforms. Also, you can use the same technique to check how much difference dotprod makes.
Thanks @jmvalin :), I've measured the time of Mac and iOS platform. The results are as follows:
| Avg time of opus_decode | Mac m3 pro | iOS iphone12promax on xcode |
|---|---|---|
| no neon | 4.5ms | 10ms |
| neon + no prodot | 0.46ms | 3.5ms |
| neon + prodot | 0.31ms | 2.5ms |
It seems that the gap between on/off neon on Mac is much bigger than iOS. If I need lower time of "neon + prodot", what can I do for vec_neon.h?
My time measuring code is here, I only measure the time when losing packet:
About the time of first doing opus_decode when packet lost
By the way, I noticed that the first time of the opus_decode is 5 times than the subsequent ones. The reason maybe here. For example, I have 4 lost packets for 80ms. The first packet lost time of opus_decode is nearly 5 times than the subsequent ones. Is it too slow?
I think it might be interesting to do the timings at a lower level than opus_decoder_dred_decode() to see if all functions behave the same or just some of them.
Well, note that I didn't use dred and the top level function is opus_decode.
Follow your suggestion, I've measured the lower level function lpcnet_plc_conceal, the comparison is as follows:
For example, I have 4 lost packets for 80ms. opus_decode takes 20ms for one round and lpcnet_plc_conceal takes 10ms. Similarly, the first packet lost time is obviously lower than the subsequent ones, and the gap of lpcnet_plc_conceal is bigger.
So how did this happen?
Then I measure the time in lpcnet_plc_conceal specifically when first packet losts. The most time-consuming module is in the "while loop". Since the PLC_BUF_SIZE is the critical value that affects PLC performance. There seems to be no solution for this.
In conclusion, there seems to be two ways:
- Reduce the single inference time of the NN module, which comes back to neon. What can I do for neon...?
- Reduce the buffer (will affects PLC performance)
Can you measure one step more precisely? Is it compute_plc_pred() taking most of that time or some other function?
Sure~ The following is the average running time of each type of function. It seems that fargan_cont() or fargan_synthesize_int() takes the most time. Their essential implementation is the same.
| model | function | time |
|---|---|---|
| PitchDNN | lpcnet_compute_single_frame_features_float() | 0.025ms |
| PLCModel | compute_plc_pred() | 0.009ms |
| FARGAN | fargan_cont() or fargan_synthesize_int() | 0.055ms |
However, the peak time depends on the "while loop", which contains lpcnet_compute_single_frame_features_float() and compute_plc_pred(). The number of loops may not be reduced because it is related to the PLC_BUF_SIZE, and a small buffer will affect performance. I have no idea how to reduce it. Any solutions? @jmvalin
Hi~ @jmvalin
Could you please check my experiments above? Is there any solution for the most time-consuming function fargan_cont() or fargan_synthesize_int()?
I mean the latest timings you posted (e.g. FARGAN taking 0.055 ms) seem to describe good performance, no?
Yes, the form above has good performance, which keeps the PLC_BUF_SIZE the default number(#define PLC_BUF_SIZE ((CONT_VECTORS+10)*FRAME_SIZE)). If I modify it to ((CONT_VECTORS+5)*FRAME_SIZE)), the time would decrease and the PLC performance suffers. Therefore, I haven't found a way to further reduce the time consumption while maintaining performance. By the way, in the fargan paper, the existing fargan model is compared to a "small fargan" version. The small fargan version is about 500k-weight model. Maybe using the "small fargan" model can reduce operating costs and time. Is this model available?
What did you have to do to get from the original numbers you gave (e.g. 2.5ms on iphone12) to the latest ones?
I think there are some misunderstandings here. The original and latest timings are from the same version code. The latest timings I posted (FARGAN 0.055 ms, PLCModel 0.009ms, PitchDNN 0.025ms) is the average time taken to execute once. It is the time consumption of each network function.
However, in the function lpcnet_plc_conceal(), the three networks will be executed multiple times, resulting in cumulative time in function opus_decode(), which resulting the average time of 2.5ms. It is the time consumption of the complete function opus_decode().
Therefore, now I need to solve the problem of the FARGAN network with the longest single-run time in order to reduce the overall time. So I need a small version of FARGAN.
Sure~ The following is the average running time of each type of function. It seems that fargan_cont() or fargan_synthesize_int() takes the most time. Their essential implementation is the same.
model function time PitchDNN lpcnet_compute_single_frame_features_float() 0.025ms PLCModel compute_plc_pred() 0.009ms FARGAN fargan_cont() or fargan_synthesize_int() 0.055ms However, the peak time depends on the "while loop", which contains lpcnet_compute_single_frame_features_float() and compute_plc_pred(). The number of loops may not be reduced because it is related to the PLC_BUF_SIZE, and a small buffer will affect performance. I have no idea how to reduce it. Any solutions? @jmvalin
It's only compute_plc_pred() that gets called multiple times (on the first loss only). FARGAN gets called exactly once for every 10 ms of concealed audio.
Yes, I got it in the previous picture posted earlier. The first loss accounts for the majority of the time, resulting in a high average latency(e.g. 80ms of data loss, 0.69ms for the first 20ms lost, and 0.14ms for the rest. Tested on Mac M3pro). compute_plc_pred() is executed in a very short time (called once), it is the while loop that maters. The PLC_BUF_SIZE maters.
Therefore, there may be two ways to reduce the time consumption, right?
- Reduce PLC_BUF_SIZE (It will reduce PLC performance) to decrease time for the first loss
- Using small version FARGAN to decrease time for the rest loss
Yes, the form above has good performance, which keeps the PLC_BUF_SIZE the default number(#define PLC_BUF_SIZE ((CONT_VECTORS+10)*FRAME_SIZE)). If I modify it to ((CONT_VECTORS+5)*FRAME_SIZE)), the time would decrease and the PLC performance suffers. Therefore, I haven't found a way to further reduce the time consumption while maintaining performance. By the way, in the fargan paper, the existing fargan model is compared to a "small fargan" version. The small fargan version is about 500k-weight model. Maybe using the "small fargan" model can reduce operating costs and time. Is this model available?