loopy test_local_parallel_scan fails on pocl-cuda and intel-cpu

Failure on pocl-cuda with n=16 can be reproduced locally. With intel-cpu it is not reproduced locally, and is intermittent on CI. See https://github.com/inducer/loopy/actions/runs/3787208816/jobs/6438795872

Oclgrind, NVIDIA, pocl-pthread all work.

Wonder if this is a test issue or a compiler issue.

Dec 28 '22 02:12 isuruf

Reported it upstream yesterday: https://github.com/pocl/pocl/issues/1157

I don't think it's our bug.

Dec 28 '22 04:12 inducer

test_local_parallel_scan fails on pocl-cuda 3.0 as well.

Dec 28 '22 04:12 isuruf

Ah. Nvm. https://github.com/pocl/pocl/issues/1157 is about the pyopencl scan. https://github.com/inducer/loopy/issues/600 is something that's been going on that @kaushikcfd promised he would fix at a point.

Dec 28 '22 04:12 inducer

Turns out we've never run GPU CI with pocl-cuda on Loopy. We should probably add that...

Dec 28 '22 04:12 inducer

Which version of the Intel CL runtime?

Dec 28 '22 04:12 inducer

2022.15.12.0.01_rel from intel/llvm repo and 2023.0.0 from conda

Dec 28 '22 04:12 isuruf

I'm leaning towards those possibly being distinct issues. I don't trust Intel CL 2022.15.12.0.01_rel on account of https://github.com/intel/llvm/issues/7877. pocl-cuda we'd have to troubleshoot, but on account of https://github.com/pocl/pocl/issues/1157 (which affects a scan), it might be best to do so on pocl-cuda 3.0.

Dec 28 '22 04:12 inducer