tonycurtis

Results 38 comments of tonycurtis

Here's the UCX_LOG_LEVEL=info output: ``` [1653170535.008386] [er02:593424:0] ucp_context.c:1855 UCX INFO Version 1.14.0 (loaded from /home/arcurtis/opt/x86_64/ucx/git/lib/libucp.so.0) [1653170535.009223] [er04:601925:0] ucp_context.c:1855 UCX INFO Version 1.14.0 (loaded from /home/arcurtis/opt/x86_64/ucx/git/lib/libucp.so.0) [1653170535.020528] [er01:595504:0] ucp_context.c:1855 UCX INFO...

Here's the UCX_LOG_LEVEL=debug output for 2 PEs: ``` $ UCX_TLS=rc oshrun -n 2 ./a.out [1653171612.558177] [er-head:875750:0] debug.c:1146 UCX DEBUG using signal stack 0x7ffff7fbb000 size 141824 [1653171612.558821] [er-head:875751:0] debug.c:1146 UCX DEBUG...

I'm going to investigate more myself, something weird going on

Looks like a relaxed order issue I'm not handling correctly.

Looks like relaxed-order is disabled on this cluster (admin ran query command as listed here: https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/1280442391/AMD+2nd+Gen+EPYC+CPU+Tuning+Guide+for+InfiniBand+HPC?preview=/1280442391/1280409615/image-20200206-071308.png#Relaxed-Ordering). However, UCX sees the AMD processor, and enables relaxed-order. Override with `UCX_IB_PCI_RELAXED_ORDERING=no`

> On Aug 16, 2022, at 5:53 PM, dmitrygx ***@***.***> wrote: > > > maybe "inter" or "ipc" ? > > we already have rkey (stands for remote), which is...

> On Sep 4, 2022, at 11:29 AM, Yossi Itigin ***@***.***> wrote: > > > @yosefe commented on this pull request. > > In src/ucp/api/ucp.h : > > > */...

Yeah, that sounds good On Wed, Sep 21, 2022 at 7:53 PM Pavel Shamis (Pasha) < ***@***.***> wrote: > ***@***.**** commented on this pull request. > ------------------------------ > > In...

Seems related to #8216 ?

Just rebuilt with pmix/libevent/hwloc set to =internal and all works now