flext icon indicating copy to clipboard operation
flext copied to clipboard

Bus error

Open i-n-g-o opened this issue 4 years ago • 8 comments

Hello.

I ran into problems using flext when compiling externals (into a library) without using FLEXT_USE_CMEM on a bela (using a xenomai linux). I did not dig deep - but it results in a Bus Error

On a desktop linux (archlinux) it does not show this behaviour - the externals load and work fine.

Any ideas where this may come from?

i-n-g-o avatar Jan 02 '22 14:01 i-n-g-o

i was just going to report the same thing for Debian/armhf.

Debian/armhf targets processors like the RaspberryPi :disappointed: (although i tested on an OdroidXU4)

here's a backtrace:

(gdb) run
Starting program: /usr/bin/pd -nogui -nrt -nosound -nomidi -lib simple1
warning: Error disabling address space randomization: Success
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
[New Thread 0xb5ea43d0 (LWP 13648)]

Thread 1 "pd" received signal SIGBUS, Bus error.
0xb69a17a4 in lockfree::CAS2<lockfree::atomic_ptr<lockfree::stack_node>, lockfree::stack_node*, unsigned int> (new2=1, new1=<optimized out>, old2=0, old1=0x0, addr=0xb69c3c04 <ThrRegistry::pending>) at lockfree/cas.hpp:129
129	lockfree/cas.hpp: No such file or directory.
(gdb) bt
#0  0xb69a17a4 in lockfree::CAS2<lockfree::atomic_ptr<lockfree::stack_node>, lockfree::stack_node*, unsigned int> (new2=1, new1=<optimized out>, old2=0, old1=0x0, addr=0xb69c3c04 <ThrRegistry::pending>) at lockfree/cas.hpp:129
#1  lockfree::atomic_ptr<lockfree::stack_node>::CAS (newptr=<optimized out>, oldval=..., this=0xb69c3c04 <ThrRegistry::pending>) at lockfree/atomic_ptr.hpp:89
#2  lockfree::intrusive_stack<LifoCell>::push (node=<optimized out>, this=0xb69c3c04 <ThrRegistry::pending>) at lockfree/stack.hpp:76
#3  Lifo::Push (cell=<optimized out>, this=0xb69c3c04 <ThrRegistry::pending>) at flcontainers.h:29
#4  TypedLifo<thr_entry>::Push (c=<optimized out>, this=0xb69c3c04 <ThrRegistry::pending>) at flcontainers.h:39
#5  PooledLifo<thr_entry, 1, 10>::Push (c=<optimized out>, this=0xb69c3c04 <ThrRegistry::pending>) at flcontainers.h:78
#6  ThrFinder<PooledLifo<thr_entry, 1, 10> >::Push (e=<optimized out>, this=0xb69c3c04 <ThrRegistry::pending>) at flthr.cpp:90
#7  flext_shared::LaunchThread (meth=meth@entry=0xb69a8189 <flext_base_shared::QWorker(flext_shared::thr_params*)>, p=0x0) at flthr.cpp:278
#8  0xb69a8238 in flext_base_shared::StartQueue () at flqueue.cpp:533
#9  0xb699e73a in flext_base_shared::AddMessageMethods (c=0x602638, dsp=<optimized out>, dspin=<optimized out>) at flext.cpp:158
#10 0xb699e7b2 in flext_base_shared::Setup (id=0x6026a0) at flext.cpp:189
#11 0xb699f59e in flext_obj_shared::obj_add (lib=<optimized out>, dsp=<optimized out>, noi=<optimized out>, attr=<optimized out>, idname=0xb6b390f0 "simple1", names=0xb6b390f0 "simple1", 
    setupfun=0xb6b39025 <simple1::__setup__(flext_class*)>, newfun=0xb6b38e9d <simple1::__init__(int, _atom*)>, freefun=0xb6b38f8d <simple1::__free__(flext_hdr*)>, argtp1=0) at fllib.cpp:383
#12 0xb6b38f74 in simple1_setup () from ./simple1.pd_linux
#13 0x0053ed62 in ?? ()

"Bus Errors" on arm typically indicate unaligned memory access...

umlaeute avatar Sep 05 '22 10:09 umlaeute

hmm, it seems this is related to my own build-system hacks.

at least, building both flext and externals with the "normal" flext-build system (flext/build pd gcc) appears to work fine.

umlaeute avatar Sep 05 '22 10:09 umlaeute

i stand corrected again.

building both flext and externals with the "normal" flext-build system will not result in a "Bus Error", but instead the external will hang Pd and consume 100% of a CPU (so I guess it just entered some endless loop).

so there is an issue with flext itself :-(

umlaeute avatar Sep 05 '22 11:09 umlaeute

Thank you, i will have a look. That might be the time to bring in boost::atomic instead of hard to maintain self-made code.

grrrr avatar Sep 06 '22 10:09 grrrr

Do you get the ../../source/lockfree/cas.hpp:217:9: warning: #warning blocking CAS2 emulation [-Wcpp] warning on compilation? That is what i have on armhf with gcc (Raspbian 10.2.1-6+rpi1) 10.2.1 20210110 and it's definitely bad. On the other hand, i don't see problems on loading. The crash seems to originate from the __sync_bool_compare_and_swap_8 intrinsic though which points to another source of the problem.

grrrr avatar Sep 08 '22 19:09 grrrr

the logs for building libflext can be accessed on https://buildd.debian.org/status/package.php?p=pd-flext and the logs for the test (which builds an external and links it with libflext) can be accessed on https://ci.debian.net/packages/p/pd-flext/

the actual test can be found on https://salsa.debian.org/multimedia-team/pd/pd-flext/-/tree/master/debian/tests (but it's really just compiling tutorial/3_attr1 and then running a simple test-patch on it.

umlaeute avatar Sep 09 '22 20:09 umlaeute

Hi thank you will test the debian source package myself. From the logs, i find it a little strange that the testbed kernel announces itself as arm64, testing armhf architecture.

grrrr avatar Sep 10 '22 10:09 grrrr

that's because the tests are obviously run on an arm64 CPU (which can execute armhf instructions, similar to an x86_64 CPU which can also run i386 binaries).

however, I also conducted tests on the OdroidXU4, which is a 32bit arm CPU, with the same results.

umlaeute avatar Sep 11 '22 19:09 umlaeute

funnily enough, it seems that the same test succeeds when run on a "Raspberry Pi 4" (using Rasbian/buster in armhf (32bit) mode)

umlaeute avatar Nov 16 '22 14:11 umlaeute

also i just noticed that the OP said:

without using FLEXT_USE_CMEM

(which probably somehow implies, that it does work when using with FLEXT_USE_CMEM).

i would like to stress, that i am building with -DFLEXT_USE_CMEM.

umlaeute avatar Nov 16 '22 14:11 umlaeute

and here's some output of valgrind:

==664373== Process terminating with default action of signal 7 (SIGBUS)
==664373==  Invalid address alignment at address 0x52F15BC
==664373==    at 0x52CFD5C: UnknownInlinedFun (cas.hpp:129)
==664373==    by 0x52CFD5C: UnknownInlinedFun (atomic_ptr.hpp:89)
==664373==    by 0x52CFD5C: UnknownInlinedFun (stack.hpp:76)
==664373==    by 0x52CFD5C: UnknownInlinedFun (flcontainers.h:29)
==664373==    by 0x52CFD5C: Push (flcontainers.h:39)
==664373==    by 0x52CFD5C: Push (flcontainers.h:78)
==664373==    by 0x52CFD5C: Push (flthr.cpp:90)
==664373==    by 0x52CFD5C: flext_shared::LaunchThread(void (*)(flext_shared::thr_params*), flext_shared::thr_params*) (flthr.cpp:278)
==664373==    by 0x52D70FF: flext_base_shared::StartQueue() (flqueue.cpp:533)
==664373==    by 0x52CC905: flext_base_shared::AddMessageMethods(_class*, bool, bool) (flext.cpp:158)
==664373==    by 0x52CC97D: flext_base_shared::Setup(flext_class*) (flext.cpp:189)
==664373==    by 0x52CD821: flext_obj_shared::obj_add(bool, bool, bool, bool, char const*, char const*, void (*)(flext_class*), flext_obj_shared* (*)(int, _atom*), void (*)(flext_hdr*), int, ...) (fllib.cpp:383)
==664373==    by 0x51113CB: attr1_setup (in /home/umlaeute/umlaeute-pd-flext/pd-flext-0.6.2/tutorial/3_attr1/attr1.pd_linux)
==664373==    by 0x17CE33: ??? (in /usr/bin/puredata)

umlaeute avatar Nov 16 '22 14:11 umlaeute

so with the feedback from https://github.com/grrrr/flext/issues/50#issuecomment-1241165357, my current workaround for this issue is to force the use of the blocking CAS/CAS2 emulation on the affected architectures (armhf & armel).

the patch can be found at the Debian pd-flext repository and is basically adding another (set of) define(s), namely USE_BLOCKING_CAS and USE_BLOCKING_CAS2. Setting these at build time will just skip to the blocking implementations. I might have missed some ifdef'ed implementation (e.g. the _MSC_VER block is obviously not skipped if USE_BLOCKING_CAS, which is mostly because i only really care about Debian, where we don't use microsoft compilers...) The point of the patch is, that the forcing of the blocking-behaviour is a purely opt-in.

This is not optimal, or - to put it with @grrrr's words:

it's definitely bad

however, my reasoning is, that a "definitely bad", blocking, non-realtime safe workaround is much better than a crash at startup. also, i see little harm on the armel architecture, which lacks a hard-float unit so i don't think anybody is actually using such a device to run Pd (not to mention flext). Things are of course a bit different with armhf (which is the architecture of Raspberry Pi OS/32bit). But then:

  • with newer RPis, i guess/hope/urge that people will switch to 64bit (aarch aka arm64) where there seems to be no problem
  • with RPiOS, it seems to already fall back to the blocking default (i think the reason that you are seeing the fallback and I am seeing the crash is, that I am using an ordinary Debian installation, and you are using a Raspbian installation, which is known to have a lower baseline CPU (also targeting the RPi0 and RPi1), so __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 is probably not defined by your compiler...)

umlaeute avatar Nov 22 '22 07:11 umlaeute

Thank you IOhannes, for the clarification. That makes a lot of sense to me. I am sorry for being so slow in catching up with the issue. My plan would be to see if boost::atomic could help with this. I would like to outsource the explicit handling of architectures, also for the future.

grrrr avatar Nov 22 '22 07:11 grrrr