test_local_parallel_scan fails on pocl-cuda and intel-cpu
Failure on pocl-cuda with n=16 can be reproduced locally.
With intel-cpu it is not reproduced locally, and is intermittent on CI. See https://github.com/inducer/loopy/actions/runs/3787208816/jobs/6438795872
Oclgrind, NVIDIA, pocl-pthread all work.
Wonder if this is a test issue or a compiler issue.
Reported it upstream yesterday: https://github.com/pocl/pocl/issues/1157
I don't think it's our bug.
test_local_parallel_scan fails on pocl-cuda 3.0 as well.
Ah. Nvm. https://github.com/pocl/pocl/issues/1157 is about the pyopencl scan. https://github.com/inducer/loopy/issues/600 is something that's been going on that @kaushikcfd promised he would fix at a point.
Turns out we've never run GPU CI with pocl-cuda on Loopy. We should probably add that...
Which version of the Intel CL runtime?
2022.15.12.0.01_rel from intel/llvm repo and 2023.0.0 from conda
I'm leaning towards those possibly being distinct issues. I don't trust Intel CL 2022.15.12.0.01_rel on account of https://github.com/intel/llvm/issues/7877. pocl-cuda we'd have to troubleshoot, but on account of https://github.com/pocl/pocl/issues/1157 (which affects a scan), it might be best to do so on pocl-cuda 3.0.