UCX 1.17.0RC1 gtest hang
Describe the bug
While testing UCX 1.17.0RC1 we hit a problem with UCX gtest, which seems to end up in an endless loop during initialization and hangs.
Setup and versions
- OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...) ubuntu 22.04, x86_64 AMD EPYC 7V12
- For RDMA/IB/RoCE related issues:
- Driver version:
$ sudo dpkg -s rdma-core
Package: rdma-core
Status: install ok installed
Priority: optional
Section: net
Installed-Size: 229
Maintainer: Ubuntu Developers <[email protected]>
Architecture: amd64
Version: 39.0-1
$ ibstat
CA 'rocep177s0'
CA type: MT4117
Number of ports: 1
Firmware version: 14.25.8102
Hardware version: 0
Node GUID: 0xf46b8c6c2f580001
System image GUID: 0xf46b8c6c2f580001
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0xf66b8cfffe6c2f58
Link layer: Ethernet
Additional information (depending on the issue)
- Output of
ucx_info -dattached to ticket
@edgargabriel can specify which gtest are you running? (with full command line).
@evgeny-leksikov can check?
Any command line argument will generate the issue on this platform, include just doing ./gtest
gtest gets stuck uct_test_base::enum_resources() (resp. uct_md_query_tl_resources invoked from there) for the gga_mlx5 tl. For whatever reason the component->tl_list seems to be ill formed and the loop at https://github.com/openucx/ucx/blob/master/src/uct/base/uct_md.c#L93 never terminates. (I tried to count the number of entries in the tl_list by inserting a ucs_list_length() in the code, but ucs_list_length also doesn't return, so its really the list that has an issue)
What is confusing is that the same tl_list looks correct for gga_mlx5 when invoked from ucx_info, so the question is what does gtest do differently than ucx_info when initializing the structure
as an additional data point, running gtest through valgrind memchecker does not show any memory corruption
@edgargabriel could you pls share config.log?
@edgargabriel unfortunally I cannot reproduce the hang on 2 different servers.
Could you please check that you have the same state of the list under gdb (see // <- comments):
gdb --args ./gtest
...
b uct_tl_register // <- set break point
r
...
Breakpoint 1, uct_tl_register (component=0x7ffff7a30980 <uct_self_component>, tl=0x7ffff7a30de0 <uct_self_tl>) at ../../../src/uct/base/uct_iface.c:986
d // <- delete break point
b uct_tl_register if component == &uct_gga_component // <- set conditional break point only for gga TL
r // re-run to catch only gga related call
...
Breakpoint 2, uct_tl_register (component=0x7ffff713e920 <uct_gga_component>, tl=0x7ffff713ea60 <uct_gga_mlx5_tl>) at ../../../src/uct/base/uct_iface.c:986
986 ucs_list_add_tail(&ucs_config_global_list, &tl->config.list);
(gdb) n
987 ucs_list_add_tail(&component->tl_list, &tl->list);
(gdb) p component->tl_list // <- here the list is empty
$1 = {
prev = 0x7ffff713e9d8 <uct_gga_component+184>,
next = 0x7ffff713e9d8 <uct_gga_component+184>
}
(gdb) p component->tl_list.next // <- make sure that the list is empty :-)
$2 = (struct ucs_list_link *) 0x7ffff713e9d8 <uct_gga_component+184>
(gdb) n // <- step next to add TL to the list
988 }
(gdb) p component->tl_list
$3 = {
prev = 0x7ffff713eab8 <uct_gga_mlx5_tl+88>,
next = 0x7ffff713eab8 <uct_gga_mlx5_tl+88>
}
(gdb) p component->tl_list.next // <- the entry is added
$4 = (struct ucs_list_link *) 0x7ffff713eab8 <uct_gga_mlx5_tl+88>
(gdb) p component->tl_list.next.next // <- there is the only 1 entry in the list
$5 = (struct ucs_list_link *) 0x7ffff713e9d8 <uct_gga_component+184>
this is what I would expect, I have only one breakpoint hit and checked it on 2 different servers,
if you have the same state, at this point I think you can catch corruption by watch gdb command by component->tl_list.next.next address (https://sourceware.org/gdb/current/onlinedocs/gdb.html/Set-Watchpoints.html)
let me know if you need help with above, we could have a debug session
I think I have multiple hits at that breakpoint, that might be the difference
$ gdb --args ./gtest
Reading symbols from ./gtest...
(gdb) break uct_tl_register
Function "uct_tl_register" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (uct_tl_register) pending.
(gdb) run
Starting program: /home/egabriel/ucx-1.17.0/build/test/gtest/gtest
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Breakpoint 1, uct_tl_register (component=0x7ffff7f2c9c0 <uct_self_component>, tl=0x7ffff7f2ce20 <uct_self_tl>) at ../../../src/uct/base/uct_iface.c:986
986 ucs_list_add_tail(&ucs_config_global_list, &tl->config.list);
(gdb) d 1
(gdb) break uct_tl_register if component == &uct_gga_component
Breakpoint 2 at 0x7ffff7eeb4b4: file ../../../src/uct/base/uct_iface.c, line 986.
(gdb) c
Continuing.
Breakpoint 2, uct_tl_register (component=0x7ffff7bfd940 <uct_gga_component>, tl=0x7ffff7bfda80 <uct_gga_mlx5_tl>) at ../../../src/uct/base/uct_iface.c:986
986 ucs_list_add_tail(&ucs_config_global_list, &tl->config.list);
(gdb) p component->tl_list
$1 = {prev = 0x7ffff7bfd9f8 <uct_gga_component+184>, next = 0x7ffff7bfd9f8 <uct_gga_component+184>}
(gdb) p component->tl_list.next
$2 = (struct ucs_list_link *) 0x7ffff7bfd9f8 <uct_gga_component+184>
(gdb) n
987 ucs_list_add_tail(&component->tl_list, &tl->list);
(gdb)
988 }
(gdb) p component->tl_list
$3 = {prev = 0x7ffff7bfdad8 <uct_gga_mlx5_tl+88>, next = 0x7ffff7bfdad8 <uct_gga_mlx5_tl+88>}
(gdb) p component->tl_list.next
$4 = (struct ucs_list_link *) 0x7ffff7bfdad8 <uct_gga_mlx5_tl+88>
(gdb) p component->tl_list.next.next
$5 = (struct ucs_list_link *) 0x7ffff7bfd9f8 <uct_gga_component+184>
(gdb) c
Continuing.
[New Thread 0x7fffea5ff640 (LWP 506962)]
[New Thread 0x7fffe9dfe640 (LWP 506963)]
[Thread 0x7fffe9dfe640 (LWP 506963) exited]
Thread 1 "gtest" hit Breakpoint 2, uct_tl_register (component=0x7ffff7bfd940 <uct_gga_component>, tl=0x7ffff7bfda80 <uct_gga_mlx5_tl>) at ../../../src/uct/base/uct_iface.c:986
986 ucs_list_add_tail(&ucs_config_global_list, &tl->config.list);
(gdb) p component->tl_list
$7 = {prev = 0x7ffff7bfdad8 <uct_gga_mlx5_tl+88>, next = 0x7ffff7bfdad8 <uct_gga_mlx5_tl+88>}
(gdb) n
987 ucs_list_add_tail(&component->tl_list, &tl->list);
(gdb)
988 }
(gdb) p component->tl_list
$8 = {prev = 0x7ffff7bfdad8 <uct_gga_mlx5_tl+88>, next = 0x7ffff7bfdad8 <uct_gga_mlx5_tl+88>}
(gdb) p component->tl_list.next
$9 = (struct ucs_list_link *) 0x7ffff7bfdad8 <uct_gga_mlx5_tl+88>
(gdb) p component->tl_list.next.next
$10 = (struct ucs_list_link *) 0x7ffff7bfdad8 <uct_gga_mlx5_tl+88>
(gdb) p component->tl_list.next.next.next
$11 = (struct ucs_list_link *) 0x7ffff7bfdad8 <uct_gga_mlx5_tl+88>
I would be open for a joint debug session, lets sync up by email
In fact, uct_tl_register is also invoked twice for uct_ib_component for each tl (uct_dc_mlx5_tl, uct_rc_verbs_tl, uct_rc_mlx5_tl, uct_ud_verbs_tl, uct_ud_mlx5_tl), so its not just uct_gga_component that seems to be registered twice.
Note that uct_tl_register is NOT invoked twice for tcp, self, and the shared memory components, just ib and gga.
just ib and gga
this is expected for IB because it has multiple transports (rc, dc, etc), but gga component has only 1 transport - gga_mlx5, could you pls check behaviour on your second setup where gtest works?
I will check on the other platform, but just to clarify, you would expect uct_ib_component to be registered twice with uct_dc_mlx5_tl, twice with uct_rc_verbs_tl, etc?
ah.. sorry, I missed that, the difference ib and gga components from others is they both placed in libuct_ib.so and loaded from lib constructor, probably by some reason it was loaded twice?
I'm able to reproduce the hang in broken environment:
$ LD_PRELOAD=<install>/lib/ucx/libuct_ib.so LD_DEBUG=files ./gtest 2>&1 | grep libuct_ib.so
2243910: file=<install>/lib/ucx/libuct_ib.so [0]; needed by ./gtest [0]
2243910: file=<install>/lib/ucx/libuct_ib.so [0]; generating link map
2243910: calling init: <install>/lib/ucx/libuct_ib.so
2243910: file=<build>/src/ucs/.libs/ucx/libuct_ib.so.0 [0]; dynamically loaded by <build>/src/ucs/.libs/libucs.so.0 [0]
2243910: file=<build>/src/ucs/.libs/ucx/libuct_ib.so.0 [0]; generating link map
2243910: calling init: <build>/src/ucs/.libs/ucx/libuct_ib.so.0
2243910: opening file=<build>/src/ucs/.libs/ucx/libuct_ib.so.0 [0]; direct_opencount=1
2243910: <build>/src/ucs/.libs/ucx/libuct_ib.so.0: error: symbol lookup error: undefined symbol: ucs_module_global_init (fatal)
callstack:
│Thread 1 (Thread 0x7f6e35512800 (LWP 2243910)):
│#0 0x00007f6e32a971b4 in __strcmp_avx2 () from /lib64/libc.so.6
│#1 0x00007f6e34fffcf1 in uct_gga_mlx5_query_tl_devices (md=0x221ba70, tl_devices_p=0x7ffd9ae7de98, num_tl_devices_p=0x7ffd9ae7de94) at ../../../../src/uct/ib/rc/accel/gga_mlx5.c:217
│#2 0x00007f6e34a6367b in uct_md_query_tl_resources (md=0x221ba70, resources_p=0x7ffd9ae7e9b0, num_resources_p=0x7ffd9ae7e9ac) at ../../../src/uct/base/uct_md.c:94
│#3 0x0000000000753c87 in uct_test::enum_resources (tl_name="rc_mlx5") at ../../../test/gtest/uct/uct_test.cc:426
│#4 0x00000000005dafb4 in gtest_rc_mlx5uct_amo_add_xor_test_EvalGenerator_ () at ../../../test/gtest/uct/test_amo_add_xor.cc:76
│#5 0x00000000005e55f1 in testing::internal::ParameterizedTestSuiteInfo<uct_amo_add_xor_test>::RegisterTests (this=0x217c840) at ../../../test/gtest/common/googletest/internal/gtest-param-util.h:537
│#6 0x0000000000eee19e in testing::internal::ParameterizedTestSuiteRegistry::RegisterTests (this=0x2177f28) at ../../../../../test/gtest/common/googletest/internal/gtest-param-util.h:699
│#7 0x0000000000edeb64 in testing::internal::UnitTestImpl::RegisterParameterizedTests (this=0x2177e40) at ../../../../../test/gtest/common/googletest/gtest.cc:2651
│#8 0x0000000000eea265 in testing::internal::UnitTestImpl::PostFlagParsingInit (this=0x2177e40) at ../../../../../test/gtest/common/googletest/gtest.cc:5138
│#9 0x0000000000ef59f8 in testing::internal::InitGoogleTestImpl<char> (argc=0x7ffd9ae7eefc, argv=0x7ffd9ae7f1b8) at ../../../../../test/gtest/common/googletest/gtest.cc:6100
│#10 0x0000000000eec93b in testing::InitGoogleTest (argc=0x7ffd9ae7eefc, argv=0x7ffd9ae7f1b8) at ../../../../../test/gtest/common/googletest/gtest.cc:6118
│#11 0x0000000000591ec1 in main (argc=1, argv=0x7ffd9ae7f1b8) at ../../../test/gtest/common/main.cc:68
in this case libuct_ib.so was initialized twice from different paths <install> and <build>, in "normal" case I see only <build> path
@edgargabriel can you pls check correctness of so loading?
@edgargabriel I found the difference in our config.log:
on my system I have:
$ cat ./config.log | grep runpath
lt_cv_shlibpath_overrides_runpath=no
in your config.log I see:
lt_cv_shlibpath_overrides_runpath=yes
So, on your system $LD_LIBRARY_PATH=<ucx_install_dir> takes unexpected effect and loads libuc<s|m|t|p>.so from there and additionally libuct_ib.so from <build> directory since gtest has it in direct dependencies.
As a workaround you can add LDFLAGS=-Wl,--disable-new-dtags to configure command to force using rpath and disable runpath
@yosefe I think we should disable runpath for gtest by default. From my understanding <rpath|runpath> defaults may depend on many factors like versions of libtool, gcc, OS, etc. Since we cannot control everything, gtest should always use .so's which it's compiled with because we use many internals there for testing and avoid double loading of internal modules like libuct_ib.so
@evgeny-leksikov yes, gtest rpath should point to build dir and not install dir, to force loading of build libs only. Currently, I see it points to both. If we don't use rpath at all, it will load system libs, which is not what we want.