dbcast Seg Fault
Running the dbcast command and I encounter a segfault; I'm using a main node to manage two worker nodes, and the crash has so far always occurred on the second worker. The first worker is successful.
Command from main node is
mpirun -np 128 --hostfile hostfile --mca btl_tcp_if_exclude virbr0,lo,ib0 dbcast /home/mpiuser/cloud/128-files.0.0 /home/mpiuser/dbcastfile.0.0
Every process backtrace has the same pattern; looks like some kind of crash in shared memory. I can try to recompile with debug symbols. Will post a response as I gather more data.
Oct 5 16:14:50 localhost systemd-coredump[89949]: Process 89887 (dbcast) of user 1000 dumped core.
Stack trace of thread 89887:
0x00007f0b1d841187 strmap_unset (libmfu.so.3.0.0)#012#1
0x00007f0b1d841421 strmap_unsetf (libmfu.so.3.0.0)#012#2
0x0000000000402b04 GCS_Shmem_free (dbcast)#012#3
0x00000000004060c7 main (dbcast)#012#4
0x00007f0b1c4d1493 __libc_start_main (libc.so.6)#012#5
0x000000000040226e _start (dbcast)#012#012
Stack trace of thread 89910:#012#0
0x00007f0b1c59fa41 __poll (libc.so.6)#012#1
0x00007f0b1bd73015 poll_dispatch (libopen-pal.so.40)#012#2
0x00007f0b1bd6a5d9 opal_libevent2022_event_base_loop (libopen-pal.so.40)#012#3
0x00007f0b1bd26e4e progress_engine (libopen-pal.so.40)#012#4
0x00007f0b1b2af14a start_thread (libpthread.so.0)#012#5
0x00007f0b1c5aadc3 __clone (libc.so.6)#012#012
Stack trace of thread 89915:#012#0
0x00007f0b1c5ab0f7 epoll_wait (libc.so.6)#012#1
0x00007f0b1bd6653d epoll_dispatch (libopen-pal.so.40)#012#2
0x00007f0b1bd6a5d9 opal_libevent2022_event_base_loop (libopen-pal.so.40)#012#3
0x00007f0b13d646be progress_engine (mca_pmix_pmix3x.so)#012#4
0x00007f0b1b2af14a start_thread (libpthread.so.0)#012#5
0x00007f0b1c5aadc3 __clone (libc.so.6)
Crash occurs in strmap_unset when the node to be removed has two children and replacement == node->left, which is the case when the right-most child of the node's left is the very first node.
In this case the call to strmap_node_extract_single will set node->left to NULL which causes a crash when dereferenced on line 778
Not sure the best way to go about fixing this; but will leave that for the authors
Thanks for the report and the debugging work, @NateRoiger .
@NateRoiger , thanks again for the report and the great job in debugging things. That made the fix much easier. I was able to reproduce this segfault.
I think #501 should fix it, and I've optimistically merged that in. It fixes my reproducer.
Would you also please verify that this fixes things for you?
I no longer experience the crash in strmap; but I am experiencing a hang after the broadcast is complete. I think that is a different issue which I can open up once I have some more information.
The hang occurs after "Bcast complete" until I killed dbcast on my worker0.
$mpirun -hostfile hostfile --mca btl_tcp_if_exclude virbr0,lo,ib0 dbcast /home/mpiuser/cloud/128-files.0.0 /home/mpiuser/dbcast.file
[2021-10-11T10:38:25] Creating destination directories for `/home/mpiuser/dbcast.file`
[2021-10-11T10:38:25] Broadcasting contents of `/home/mpiuser/cloud/128-files.0.0` to `/home/mpiuser/dbcast.file`
[2021-10-11T10:38:25] Progress: 100.0% 3.049585 MB/s 0.0 secs remaining
[2021-10-11T10:38:25] Bcast complete: size=1024, time=0.016042 secs, speed=0.060874 MB/sec
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 5 with PID 68558 on node worker0 exited on signal 15 (Terminated).
--------------------------------------------------------------------------