Fix #755: ARM benchmarks run without crash or swap
Fixes #755
- Drain pipeline queue before destruction (removes SIGABRT on aarch64).
- Lower ctl contention iterations to 200 k on aarch64.
All benchmarks pass on aarch64. Tested on Raspberry Pi 4 with 2GB as well as VM with 3GB – no swap growth.
:robot: Welcome! Thanks for your interest in contributing to the project!
Here is a short check-list to help you get started:
Creating pull request
- Target PR to
developbranch. - Include link to related issue in PR description.
- Ensure all CI checks pass.
Code review
- Mark PR as draft until it's ready. When ready, undraft and request review.
- Don't resolve discussions by yourself, instead leave a comment or thumbs up.
- Re-request review after addressing all discussions.
Refer to contribution guidelines for futher details.
Hi!
Thanks for looking into this, the approach with reducing benchmark parameters looks fine. So far I've tested your branch on 2 boxes, 32-bit RPi3B and 64-bit RPI4B, and I needed a few extra modifications:
diff --git a/src/tests/roc_core/bench_mpsc_queue.cpp b/src/tests/roc_core/bench_mpsc_queue.cpp
index 0e8859f5..70b07f72 100644
--- a/src/tests/roc_core/bench_mpsc_queue.cpp
+++ b/src/tests/roc_core/bench_mpsc_queue.cpp
@@ -16,7 +16,7 @@ namespace roc {
namespace core {
namespace {
-enum { BatchSize = 10000, NumIterations = 5000000, NumThreads = 16 };
+enum { BatchSize = 10000, NumIterations = 500000, NumThreads = 4 };
#if defined(ROC_BENCHMARK_USE_ACCESSORS)
inline int get_thread_index(const benchmark::State& state) {
@@ -150,8 +150,6 @@ BENCHMARK_REGISTER_F(BM_MpscQueue, TryPopFront)
->Arg(1)
->Arg(2)
->Arg(4)
- ->Arg(8)
- ->Arg(16)
->Iterations(NumIterations)
->Unit(benchmark::kMicrosecond);
@@ -184,8 +182,6 @@ BENCHMARK_REGISTER_F(BM_MpscQueue, PopFront)
->Arg(1)
->Arg(2)
->Arg(4)
- ->Arg(8)
- ->Arg(16)
->Iterations(NumIterations)
->Unit(benchmark::kMicrosecond);
diff --git a/src/tests/roc_ctl/bench_task_queue_contention.cpp b/src/tests/roc_ctl/bench_task_queue_contention.cpp
index f1f8452d..5438654f 100644
--- a/src/tests/roc_ctl/bench_task_queue_contention.cpp
+++ b/src/tests/roc_ctl/bench_task_queue_contention.cpp
@@ -17,9 +17,9 @@ namespace ctl {
namespace {
enum {
- NumScheduleIterations = 2000000,
- NumScheduleAfterIterations = 20000,
- NumThreads = 8,
+ NumScheduleIterations = 200000,
+ NumScheduleAfterIterations = 5000,
+ NumThreads = 4,
BatchSize = 1000
};
diff --git a/src/tests/roc_pipeline/bench_pipeline_loop_contention.cpp b/src/tests/roc_pipeline/bench_pipeline_loop_contention.cpp
index 57b7be0b..00b32c9f 100644
--- a/src/tests/roc_pipeline/bench_pipeline_loop_contention.cpp
+++ b/src/tests/roc_pipeline/bench_pipeline_loop_contention.cpp
@@ -31,7 +31,7 @@ namespace {
enum {
SampleRate = 1000000, // 1 sample = 1 us (for convenience)
Chans = 0x1,
- NumThreads = 16,
+ NumThreads = 4,
NumIterations = 1000000,
BatchSize = 10000,
FrameBufSize = 100
Can you please include those in your PR too?
Also, I suggest to make this solution a bit more generic.
First, arch-specific ifdefs (like __aarch64__) may lead to a mess in future if every benchmark would have ifdefs for different CPUs.
I suggest introducing 3 benchmark profiles: LARGE (for workstation-class computers and larger), MEDIUM (for powerfull SBCs like raspberry pi), and SMALL (for weak CPUs like e.g. MIPS routers).
Then in specific benchmarks we can do something like:
#ifdef ROC_BENCHMARK_PROFILE_LARGE
NumScheduleIterations = 2000000,
#else
NumScheduleIterations = 200000,
#endif
I've pushed a commit that defines ROC_CPU_FAMILY macro to cpu_traits.h (f5b9cc40ef0e6721d9e92c35a229a1c66bc2262a). We can add a header, say, src/tests/bench_profile.h, that sets ROC_BENCHMARK_PROFILE_xxx based on ROC_CPU_FAMILY.
Maybe something like this:
-
X86_64,PPC64,S390X- LARGE -
GENERIC,X86,PPC,S390,LOONGARCH64,AARCH64,MIPS64,RISCV64- MEDIUM - anything else - SMALL
Also I think we should allow overwriting selected profile via a build option, like:
scons ... --bench-profile=small
You can find examples in SConstruct, e.g. --sanitizers option. If option is given, we can add define like this: env.AppendUnique(CPPDEFINES='ROC_BENCHMARK_PROFILE_LARGE'). If option is not given, then bench_profile.h will deduce default profile from CPU.
In addition, currently number of threads in benchmarks is hard-coded, but obviously there is no good universal value. In practice we want ThreadRange(1, CpuCount()) instead of ThreadRange(1, 16).
We can implement Thread::cpu_count() in src/internal_modules/roc_core/target_posix/roc_core/thread.h and use it in benchmarks. I think it can use sched_getaffinity(), and if it fails or returns zero CPUs, it can fallback to sysconf(_SC_NPROCESSORS_ONLN) and then to _SC_NPROCESSORS_CONF if they're defined.
Also, some benchmarks use:
->Arg(1)
->Arg(2)
->Arg(4)
->Arg(8)
->Arg(16)
insteaf of ThreadsRange(1, 16), I guess we also need to rework them to use range.
Let me know what do you think about this ideas and how much time & interest do you have for this.
If you wish to implement the above, you're very welcome, otherwise let me know what parts are you interested in and I'll work on the others when I have time.
PS. cross-compiling benchmarks for aarch64 was failing for me, pushed a fix: c4ad8b8c7dde01b48ec0d67025f00956e3c6fb8a
Thanks! I'll update the PR as soon as I can. Don't want to give exact dates, but I'll aim within a week or two. I'll update with your changes and also add profiles. If time allows I can also add CpuCount(), will update you.
Hi there,
I implemented all changes except Thread::cpu_count(). I can do it for sure, and would love to, but does this have to be under same PR? Let me know.
However, I got some, hopefully, minor issues with passing checks.
This is what I did.
- Pushed new changes.
- github CI ran and reported an issue with formatting of new header file. All other checks passed.
- Fixed this with scons -Q fmt. Updated files were: bench_profile.h and bench_main.cpp(include header was rearranged).
- Did commit amend, no changes done to other files. git push --force-with-lease.
- This time, check have not passed.
Why could reformatting changes this? Or perhaps something else.
PS. I will also have a look later, but if you know the answer quicker, that would be great.
This far, in actual benchmarks, I split benchmark NumIterations, NumThreads and other parameters based on either LARGE profile, or anything else.
I only tested this on raspberry pi and my VM, which both should be considered MEDIUM profile. Do you think I should add different numbers for SMALL profile? If so, could you help me out with numbers there. I don't have any other devices to test this on.
Due to previously failed ci check(failed loopback_sink_2_source test), I ran it(ci_checks/linux-x86_64/ubuntu-24.04.sh) locally, and it passed. Then I rerun whole ci checks and now they are passing. So I would assume "not stable" test.
Otherwise, this PR is ready for review. Could someone take a look at it @gavv ?