satishskamath

Results 26 comments of satishskamath

Hi @yosefe , We have the same error, just from a different MPI call. The job is an intra-node job with execution statement: ``` mpirun -np 32 --mca pml ucx...

@yosefe : Yes. There are repetitions of these messages: ``` [root@hcn3 ~]# dmesg | grep mlx5 | cut -f 2 -d] | sort -u infiniband mlx5_0: create_mkey_callback:131:(pid 0): async reg...

@yosefe : Thank you for confirming that this is a driver issue. We will take it up with Nvidia Networking support.

@yosefe : I found this work around online but it is not recommended I guess. Can you comment? ``` MPI UCX ERROR: ivb_reg_mr If you are using the UCX layer...

@yosefe As mentioned above we approached Nvidia with the same problem. the NVIDIA networking support's first reply was: ``` syndrome (0x18af6) means that the customer is trying to allocate more...

Lines with # are not working for me as well.

> @satishskamath Are you up for cleaning up the reported code style issues? Hi @boegel , I am up for making the changes but am currently on vacation until Sept...

@yosefe We upgraded the version of `UCX` to `1.12.1` and `OpenMPI` to `4.1.4` and the crash with Memory regions does not occur anymore even with default `UCX_IB_REG_METHODS=rcache,odp,direct` and ``` UCX_IB_RCACHE_MEM_PRIO=1000...

@omor1 Does the issue go away if you set this environment variable to a large value, like you saw earlier? `UCX_IB_RCACHE_MAX_REGIONS=inf` I am testing again to check if there are...

@omor1 you were right. The issue still persists. The only thing is that UCX does not crash but in the kernel ring buffer I can still see mlx related errors...