ucx icon indicating copy to clipboard operation
ucx copied to clipboard

problem using device memory with get_ppln rendezvous Protocol

Open edgargabriel opened this issue 2 years ago • 1 comments

Describe the bug

As part of our internal test script for UCX we also run through all possible rendezvous protocols of UCX (put_zcopy, get_zcopy, put_ppln, get_ppln, am, rkey_ptr). With v1 protocols, all versions work and pass our tests. With v2 protocols, we run into a problem when using the get_ppln protocol for device memory (all other protocol types work correctly otherwise).

The problem seems to be that we get a host memory address passed to the rocm_ipc component for data transfer, which leads to an error/abort. The memory passed in stems from allocating the ucp_rndv_frags which in ucp_proto_rndv_mtype_request_init() is hard coded to be host memory type.

I am not entirely sure what the proper solution for this issue would be, or whether this is a way to circumvent this problem (other than not using get_ppln rendezvous protocol for this scenario).

Setup and versions

ucx master with v2 protocols. OS is probably irrelevant.

Additional information (depending on the issue)

ucx_get_ppln_ipc.txt

edgargabriel avatar Sep 15 '23 18:09 edgargabriel

I wanted to check back whether somebody has a suggestion on what the issue could be. The issue is still observed with UCX 1.16.0-RC1

edgargabriel avatar Dec 28 '23 16:12 edgargabriel