ROCT-Thunk-Interface libhsakmt: get correct memory info for IPC shared memory

add reigstered shared memory info. otherwise it fails to deattach the IPC shared memory.

May 25 '20 06:05 zjunweihit

Hi Junwei, can you provide more context about who fails to detach IPC memory? Is that an interaction with the ROCr runtime? I don't think there is a dependency inside libhsakmt itself on fmm_get_mem_info for releasing IPC memory.

I'll ask a ROCr engineer to take a look at this pull request as well.

May 26 '20 05:05 fxkamd

Hi Junwei, can you provide more context about who fails to detach IPC memory? Is that an interaction with the ROCr runtime? I don't think there is a dependency inside libhsakmt itself on fmm_get_mem_info for releasing IPC memory.

I'll ask a ROCr engineer to take a look at this pull request as well.

Hi Felix,

Thanks for your reply. Yes, after I do IPC import, deattach the memory, it has an error in the path, since it misses the case of registered shared memory. The error comes from the returned NULL ptr, HSA IPCDetach() is going to unmap a null ptr. Anyway, the memory will be freed when hsa shuts down. But better to free it in correct way, I think.

Regards, Jerry

May 26 '20 08:05 zjunweihit

Hi Junwei,

Please add a Signed-off-by line to your commit. Then I will merge it into our internal branch after internal code review and testing.

May 26 '20 14:05 fxkamd

Hi Junwei,

Please add a Signed-off-by line to your commit.

Done.

Then I will merge it into our internal branch after internal code review and testing.

Thanks

Regards

May 27 '20 00:05 zjunweihit

I don't think this is quite right. It will often appear to function but there will be subtle bugs that could lead to data corruption. There are several patches to ROCr's IPC handling currently in internal code review that should correct the fault.

We may need something similar if hsakmt is to support CPU access to imported memory.

May 28 '20 05:05 skeelyamd

Sorry for my late response.

After import the IPC memory, and use it, we will do hsa_amd_ipc_memory_detach() to release the memory. during the process, from below code, we know that ptr is got from PtrInfo(), then pass to hsaKmtUnmapMemoryToGPU() for unmapping.
while in fmm_get_mem_info() called by PtrInfo(), it doesn't handle SHARED ptr(aka IPC memory). So the ptr is 0, the memory is not unmapped actually for now.

https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/master/src/core/runtime/runtime.cpp#L1000

      PtrInfoBlockData block;
      hsa_amd_pointer_info_t info;
      info.size = sizeof(info);
      if (PtrInfo(ptr, &info, nullptr, nullptr, nullptr, &block) != HSA_STATUS_SUCCESS)
        return HSA_STATUS_ERROR_INVALID_ARGUMENT;
      ptr = block.base;
    }
  }
  if (hsaKmtUnmapMemoryToGPU(ptr) != HSAKMT_STATUS_SUCCESS)
    return HSA_STATUS_ERROR_INVALID_ARGUMENT;

Regards, Junwei zhang

Jun 03 '20 06:06 zjunweihit

Yes you are correct about the defect. However, the proposed patch leads to other difficulties with regard to both the thunk interface definitions and the handling of small memory allocations.

There is a patch series that fixes this issue but unfortunately just missed the 3.5 branching date. It will be in the 3.6 release. I'll see if we can post it early.

Jun 03 '20 06:06 skeelyamd

For the issue, it could be fixed in hsa, but in the view of thunk, I think it misses a kind of memory type handling. IMHO, for a lib or sw block, it's better to handle any kind of request from the user(up caller), it cannot assume the actions of callers.

Regards, Junwei Zhang

Jun 03 '20 12:06 zjunweihit

Sorry for the lateness closing this off, but it's in ROCm 3.6+

Feb 24 '23 16:02 kentrussell