ROCclr icon indicating copy to clipboard operation
ROCclr copied to clipboard

clEnqueueSVMMemcpy SegFault

Open richardmembarth opened this issue 5 years ago • 4 comments

Running clEnqueueSVMMemcpy(queue, CL_TRUE, dst, src, size, 0, NULL, NULL); with dst allocated with clSVMAlloc and src allocated by the system (e.g. posix_memalign) triggers a segmentation fault:

Thread 4 "Command Queue T" received signal SIGSEGV, Segmentation fault.

Backtrace:

#0  0x00007fffebed61b4 in amd::SharedReference<amd::Context>::operator() (this=0x68) at /space/rocm/ROCclr/platform/object.hpp:166
#1  0x00007fffebed1f1e in amd::Memory::getContext (this=0x0) at /space/rocm/ROCclr/platform/memory.hpp:302
#2  0x00007fffebfeb8e3 in roc::NullDevice::forceFineGrain (this=0x55555568f820, memory=0x0) at /space/rocm/ROCclr/device/rocm/rocdevice.hpp:194
#3  0x00007fffebfe04e0 in roc::VirtualGPU::submitSvmCopyMemory (this=0x7ffed8000b90, cmd=...) at /space/rocm/ROCclr/device/rocm/rocvirtual.cpp:1281
#4  0x00007fffebf00cd0 in amd::SvmCopyMemoryCommand::submit(device::VirtualDevice&) () from /space/rocm/ROCm-OpenCL-Runtime/build/lib/libamdocl64.so
#5  0x00007fffebf8f4e2 in amd::HostQueue::loop (this=0x5555555d1ce0, virtualDevice=0x7ffed8000b90) at /space/rocm/ROCclr/platform/commandqueue.cpp:167
#6  0x00007fffebf9251e in amd::HostQueue::Thread::run (this=0x5555555d1d88, data=0x5555555d1ce0) at /space/rocm/ROCclr/platform/commandqueue.hpp:161
#7  0x00007fffebf4a4bd in amd::Thread::main (this=0x5555555d1d88) at /space/rocm/ROCclr/thread/thread.cpp:93
#8  0x00007fffebf984a4 in amd::Thread::entry (thread=0x5555555d1d88) at /space/rocm/ROCclr/os/os_posix.cpp:318
#9  0x00007ffff6bc2609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#10 0x00007ffff71cb103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The problem are the checks in line 1281 and 1282, which will both trigger a segfault if srcMem or dstMem is a nullptr, that is, not present in the memory map (line 1259 and 1260): https://github.com/ROCm-Developer-Tools/ROCclr/blob/roc-3.5.x/device/rocm/rocvirtual.cpp#L1281-L1282 https://github.com/ROCm-Developer-Tools/ROCclr/blob/roc-3.5.x/device/rocm/rocvirtual.cpp#L1259-L1260 This will call forceFineGrain with a nullptr instead of amd::Memory*: https://github.com/ROCm-Developer-Tools/ROCclr/blob/roc-3.5.x/device/rocm/rocdevice.hpp#L193

Tested on lates ROCm 3.5.1 release with the coresponding roc-3.5.x or rocm-3.5.x branches.

richardmembarth avatar Jul 13 '20 12:07 richardmembarth

This still happens with the latest ROCm 3.9.0 release.

The same bug is triggered by test_svm in the Khronos OpenCL Conformance Tests:

./test_conformance/SVM/test_svm
...
Compute Device Name = gfx1010, Compute Device Vendor = Advanced Micro Devices, Inc., Compute Device Version = OpenCL 2.0 , CL C Version = OpenCL C 2.0
...
svm_enqueue_api...
clEnqueueSVMMemcpy case: src_alloc = host, dst_alloc = host
clEnqueueSVMMemcpy case: src_alloc = host, dst_alloc = svm
Segmentation fault (core dumped)

richardmembarth avatar Oct 29 '20 12:10 richardmembarth

clEnqueueSVMMemcpy crashes also with ROCm 5.1.3. Both in OpenCL-CTS (in the same spot as outlined above) and a standalone test program that only calls this function.

claudiubalogh avatar Jun 15 '22 14:06 claudiubalogh

I believe ROCM5.2 should have a fix for this issue.

gandryey avatar Jun 15 '22 19:06 gandryey

Tested W5700 with rocm-5.2.1 and W6600 with rocm-5.2.3 on different machines: segfault does not occur anymore but OpenCL-CTS SVM test fails on both machines with:

     svm_enqueue_api...
     clEnqueueSVMMemcpy case: src_alloc = host, dst_alloc = host
     clEnqueueSVMMemcpy case: src_alloc = host, dst_alloc = svm
     Invalid data at index 0, dst_ptr 99, src_ptr 53
     svm_enqueue_api FAILED

claudiubalogh avatar Aug 29 '22 05:08 claudiubalogh