Andreas Klöckner
Andreas Klöckner
I'd be happy to take a patch.
> It init's the driver, then spawns a child process According to Nvidia, you may not `fork()` after intializing CUDA.
> Depend on an instruction only if it writes to the variables that the > compute instruction reads. Is this correct? The write could occur anywhere in the transitive closure...
With - https://github.com/pocl/pocl/pull/1069 - https://github.com/inducer/pyopencl/pull/452 the following passes for me: ``` LOOPY_NO_CACHE=1 pycl test_target.py 'test_passing_bajillions_of_svm_args(cl._csc)' ``` Let me know if you can reproduce that.
I obviously can't guarantee that that's what at issue here, but I suspect you'll need https://github.com/pocl/pocl/pull/1069 (or another fix for the same issue) in order to allow this to work....
With `CU_MEM_ATTACH_GLOBAL`, I don't think you have a guarantee that the memory should be accessible from the host. Also, since you seem to attribute the crash in the sample code...
Btw, I agree that this discussion does not have much to do with Loopy. Maybe let's continue the discussion here: https://github.com/inducer/pyopencl/pull/452.
> I found another fix (workaround?) in [pocl/pocl@03ffc71](https://github.com/pocl/pocl/commit/03ffc7146f425bee6e6345dfe4208d095ddd7e7b) which just uses CUDA functions for the memfill operation. With that fix, my simple test and the test in this PR also...
@matthiasdiener Please don't force-push to branches on which more than one person is working. Not only is there a risk of clobbering one another's work, it's also very hard to...
In either case, you'll need a global barrier. After that, you might as well run a (short!) sequential reduction loop, which is going to be faster (and matches best practices...