mpifileutils icon indicating copy to clipboard operation
mpifileutils copied to clipboard

dbcast Seg Fault

Open NateThornton opened this issue 4 years ago • 4 comments

Running the dbcast command and I encounter a segfault; I'm using a main node to manage two worker nodes, and the crash has so far always occurred on the second worker. The first worker is successful.

Command from main node is mpirun -np 128 --hostfile hostfile --mca btl_tcp_if_exclude virbr0,lo,ib0 dbcast /home/mpiuser/cloud/128-files.0.0 /home/mpiuser/dbcastfile.0.0

Every process backtrace has the same pattern; looks like some kind of crash in shared memory. I can try to recompile with debug symbols. Will post a response as I gather more data.

Oct  5 16:14:50 localhost systemd-coredump[89949]: Process 89887 (dbcast) of user 1000 dumped core.

Stack trace of thread 89887:
0x00007f0b1d841187 strmap_unset (libmfu.so.3.0.0)#012#1  
0x00007f0b1d841421 strmap_unsetf (libmfu.so.3.0.0)#012#2  
0x0000000000402b04 GCS_Shmem_free (dbcast)#012#3  
0x00000000004060c7 main (dbcast)#012#4  
0x00007f0b1c4d1493 __libc_start_main (libc.so.6)#012#5  
0x000000000040226e _start (dbcast)#012#012

Stack trace of thread 89910:#012#0  
0x00007f0b1c59fa41 __poll (libc.so.6)#012#1  
0x00007f0b1bd73015 poll_dispatch (libopen-pal.so.40)#012#2  
0x00007f0b1bd6a5d9 opal_libevent2022_event_base_loop (libopen-pal.so.40)#012#3  
0x00007f0b1bd26e4e progress_engine (libopen-pal.so.40)#012#4  
0x00007f0b1b2af14a start_thread (libpthread.so.0)#012#5  
0x00007f0b1c5aadc3 __clone (libc.so.6)#012#012

Stack trace of thread 89915:#012#0  
0x00007f0b1c5ab0f7 epoll_wait (libc.so.6)#012#1  
0x00007f0b1bd6653d epoll_dispatch (libopen-pal.so.40)#012#2  
0x00007f0b1bd6a5d9 opal_libevent2022_event_base_loop (libopen-pal.so.40)#012#3  
0x00007f0b13d646be progress_engine (mca_pmix_pmix3x.so)#012#4  
0x00007f0b1b2af14a start_thread (libpthread.so.0)#012#5  
0x00007f0b1c5aadc3 __clone (libc.so.6)

NateThornton avatar Oct 05 '21 21:10 NateThornton

Crash occurs in strmap_unset when the node to be removed has two children and replacement == node->left, which is the case when the right-most child of the node's left is the very first node.

In this case the call to strmap_node_extract_single will set node->left to NULL which causes a crash when dereferenced on line 778

Not sure the best way to go about fixing this; but will leave that for the authors

NateThornton avatar Oct 06 '21 20:10 NateThornton

Thanks for the report and the debugging work, @NateRoiger .

adammoody avatar Oct 07 '21 14:10 adammoody

@NateRoiger , thanks again for the report and the great job in debugging things. That made the fix much easier. I was able to reproduce this segfault.

I think #501 should fix it, and I've optimistically merged that in. It fixes my reproducer.

Would you also please verify that this fixes things for you?

adammoody avatar Oct 07 '21 16:10 adammoody

I no longer experience the crash in strmap; but I am experiencing a hang after the broadcast is complete. I think that is a different issue which I can open up once I have some more information.

The hang occurs after "Bcast complete" until I killed dbcast on my worker0.

$mpirun -hostfile hostfile --mca btl_tcp_if_exclude virbr0,lo,ib0 dbcast /home/mpiuser/cloud/128-files.0.0 /home/mpiuser/dbcast.file
[2021-10-11T10:38:25] Creating destination directories for `/home/mpiuser/dbcast.file`
[2021-10-11T10:38:25] Broadcasting contents of `/home/mpiuser/cloud/128-files.0.0` to `/home/mpiuser/dbcast.file`
[2021-10-11T10:38:25] Progress: 100.0% 3.049585 MB/s 0.0 secs remaining
[2021-10-11T10:38:25] Bcast complete: size=1024, time=0.016042 secs, speed=0.060874 MB/sec
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 5 with PID 68558 on node worker0 exited on signal 15 (Terminated).
--------------------------------------------------------------------------

NateThornton avatar Oct 11 '21 14:10 NateThornton