Add OpenCL runtime support
Abstract runtime functionality by HSA (TODO: OCL) Use Requires to load OpenCL bindings Allow choosing runtime via environment variables
Closes #20, closes #23
Whoooo!
So far I've gotten kernels to launch through OpenCL, however they currently segfault on the GPU because (as I understand it) we aren't extracting the correct device-side pointer from cl.Buffer when we convert it to ROCDeviceArray (and then to FakeDeviceArray), so this shouldn't be expected to work right now. I'll probably need to figure out how to allocate buffers from OpenCL that mirror what we do with hsa_memory_alloc in finegrained mode, and then we should be able to extract (somehow) a pointer which works from host or device and pass that in.
Key note for reviewers: we (and LLVM) expect our array arguments to be of type ROCDeviceArray during compilation, so our kernels extract the pointer from that struct to get the actual buffer pointer. OpenCL apparently just passes pointers to raw buffers (like how things are done in C) instead of using nested structs, so we need to trick OpenCL into writing our ROCDeviceArray structs directly into the kernarg buffer. This part is working thanks to some code in OpenCL.jl which automatically handles isbits structs, so it's now on us to ensure that the right device-accessible pointer is embedded into the struct.
Note to self: If we do implement a hacky (slow) workaround to getting the device pointer, we should also provide a shortcut via clSVMAlloc which supposedly does exactly what we do with HSARuntime. This of course requires OpenCL 2.0, but that's reasonable to expect if one wants the best performance.
Now I've got kernels running without segfaults (see the new test/opencl.jl test script), but it appears that the C array never gets written to. If anyone has an idea for why this is happening, I'm all ears!
If anyone has an idea for why this is happening, I'm all ears!
Do you need to synchronize the memory?
It doesn't seem like that's the issue since we wait on the kernel's event, and even adding in a sync_workgroup() call to the kernel doesn't seem to do anything.
If anyone has a working ROCm debugger setup, it would be great if we could see what instructions the GPU is actually executing (including memory addresses). I suspect we aren't writing to the correct location.