[WIP] Fix operations issues on large arrays
About
This MR fixes a "pycuda hanging forever" issue when array sizes exceed 2**34 bytes.
It's done by replacing some occurrences of unsigned (int) with size_t in template kernels (element-wise, reduction, scan).
Close #375
The tests had to be done on arrays of double to avoid numerical issues.
ElementWise
import numpy as np
import pycuda.autoinit
import pycuda.gpuarray as garray
from pycuda.elementwise import ElementwiseKernel
eltwise = ElementwiseKernel("double* d_arr", "d_arr[i] = i", "linspace")
d_arr = garray.empty((512, 2048, 2048), np.float64)
eltwise(d_arr)
result = d_arr.get()[()]
reference = np.arange(d_arr.size, dtype=np.float64).reshape(d_arr.shape)
assert np.allclose(result, reference)
Reduction
import numpy as np
import pycuda.autoinit
import pycuda.gpuarray as garray
from pycuda.reduction import ReductionKernel
reduction = ReductionKernel(np.float64, neutral="0", reduce_expr="a+b", map_expr="x[i]", arguments="double* x")
d_arr = garray.zeros((512, 2048, 2048), np.float64)
d_arr.fill(1) # elementwise
result = reduction(d_arr.ravel()).get()[()]
assert result == d_arr.size
Scan
import numpy as np
import pycuda.autoinit
import pycuda.gpuarray as garray
from pycuda.scan import InclusiveScanKernel
cumsum = InclusiveScanKernel(np.float64, "a+b")
d_arr = garray.zeros((512, 2048, 2048), np.float64)
d_arr.fill(1)
result = cumsum(d_arr.ravel()).get()[()]
assert result[-1] == d_arr.size
Thanks! Pushed to Gitlab for CI at https://gitlab.tiker.net/inducer/pycuda/-/commits/large_arrays_ops. (Click through from there to pipeline result).
Thanks for providing tests in the Github description. Could you add them to the automated tests? See test/test_gpuarray.py for example.
Those CI runs mostly failed, by the looks of it. See for example https://gitlab.tiker.net/inducer/pycuda/-/jobs/461930#L2128:
____________________ TestGPUArray.test_ranged_elwise_kernel ____________________
Traceback (most recent call last):
File "/var/lib/gitlab-runner/builds/EsrAZ4Sv/0/inducer/pycuda/test/test_gpuarray.py", line 511, in test_ranged_elwise_kernel
drv.Context.synchronize()
pycuda._driver.LogicError: cuCtxSynchronize failed: an illegal memory access was encountered
--------------------------- Captured stderr teardown ---------------------------
Thanks, I'll look into it
Last commit seems to have fixed the problem.
Tried on
- Titan V, Debian 11, cuda 11.2, Python 3.9.2
- Tesla V100, Ubuntu 20.04, cuda 10.1, Python 3.8.10
Interestingly, the kernel template in get_elwise_range_module() needs long i; and fails for unsigned long, suggesting that the running index has to be a signed integer. I haven't looked at the details, perhaps a dtype_to_ctype() fails somewhere when preparing the kernel.
I'll add the above tests in the pycuda test suite.
Side note, I have a failing test on the first platform with test_cumath.py::test_atan - for both main branch and this one. But the error seems to be only slightly above the threshold.
> assert (max_err <= threshold).all(), (max_err, name, dtype)
E AssertionError: (2.3841858e-07, 'atan', <class 'numpy.float32'>)
E assert False
E + where False = <built-in method all of numpy.bool_ object at 0x7fa9540bd980>()
E + where <built-in method all of numpy.bool_ object at 0x7fa9540bd980> = 2.3841858e-07 <= 2e-07.all
/tmp/paleo/pycuda_upstream/test/test_cumath.py:44: AssertionError
Thanks! Pushed to Gitlab for CI.
Pushed again.
You've got this marked "WIP" in the title. Is this ready to go from your end?
Tests are now passing on gitlab CI. Good! Although the ones from this PR are skipped because none of the card has enough memory.
Before merging I'd like to rework the tests a little bit, as I feel they are cumbersome.
Basically the main constraint is that the tested arrays must have n_items >= 2**32 (4.3 GB for 1B dtypes, 17.2 GB for 4B dtypes).
Unfortunately, numerical issues quickly pop up:
-
array[i] = i(elementwise) does not work foruint32(overflow), nor forfloat32(which cannot exactly represent all integers> 2**24). Fortunately only the index has to be tested, not the value, so we can just doarray[i] = 1and use a 1B datatype. - Summing
2**32elements (reduction) ends up in overflow (uint32) or loss of precision (float32). Here again we can only test the running indexiand not the value, so we can do something like (a + b) mod 256 and keep uint8 dtype.
Should be better. The tests now require "only" 4.5 GB device memory (skipped otherwise) and should be faster.
Pushed to Gitlab.
Thanks! test passed on all 3 CI GPUs. This PR should be good now.
Pushed to Gitlab.
LGTM, thanks!