pycuda [WIP] Fix operations issues on large arrays

About

This MR fixes a "pycuda hanging forever" issue when array sizes exceed 2**34 bytes. It's done by replacing some occurrences of unsigned (int) with size_t in template kernels (element-wise, reduction, scan).

Close #375

The tests had to be done on arrays of double to avoid numerical issues.

ElementWise

import numpy as np
import pycuda.autoinit
import pycuda.gpuarray as garray
from pycuda.elementwise import ElementwiseKernel

eltwise = ElementwiseKernel("double* d_arr", "d_arr[i] = i", "linspace")
d_arr = garray.empty((512, 2048, 2048), np.float64)
eltwise(d_arr)
result = d_arr.get()[()]
reference = np.arange(d_arr.size, dtype=np.float64).reshape(d_arr.shape)
assert np.allclose(result, reference)

Reduction

import numpy as np
import pycuda.autoinit
import pycuda.gpuarray as garray
from pycuda.reduction import ReductionKernel

reduction = ReductionKernel(np.float64, neutral="0", reduce_expr="a+b", map_expr="x[i]", arguments="double* x")
d_arr = garray.zeros((512, 2048, 2048), np.float64)
d_arr.fill(1) # elementwise
result = reduction(d_arr.ravel()).get()[()]
assert result == d_arr.size

Scan

import numpy as np
import pycuda.autoinit
import pycuda.gpuarray as garray
from pycuda.scan import InclusiveScanKernel

cumsum = InclusiveScanKernel(np.float64, "a+b")
d_arr = garray.zeros((512, 2048, 2048), np.float64)
d_arr.fill(1)
result = cumsum(d_arr.ravel()).get()[()]
assert result[-1] == d_arr.size

Sep 19 '22 15:09 pierrepaleo

Thanks! Pushed to Gitlab for CI at https://gitlab.tiker.net/inducer/pycuda/-/commits/large_arrays_ops. (Click through from there to pipeline result).

Thanks for providing tests in the Github description. Could you add them to the automated tests? See test/test_gpuarray.py for example.

Sep 19 '22 22:09 inducer

Those CI runs mostly failed, by the looks of it. See for example https://gitlab.tiker.net/inducer/pycuda/-/jobs/461930#L2128:

____________________ TestGPUArray.test_ranged_elwise_kernel ____________________
Traceback (most recent call last):
  File "/var/lib/gitlab-runner/builds/EsrAZ4Sv/0/inducer/pycuda/test/test_gpuarray.py", line 511, in test_ranged_elwise_kernel
    drv.Context.synchronize()
pycuda._driver.LogicError: cuCtxSynchronize failed: an illegal memory access was encountered
--------------------------- Captured stderr teardown ---------------------------

Sep 19 '22 23:09 inducer

Thanks, I'll look into it

Sep 21 '22 07:09 pierrepaleo

Last commit seems to have fixed the problem.

Tried on

Titan V, Debian 11, cuda 11.2, Python 3.9.2
Tesla V100, Ubuntu 20.04, cuda 10.1, Python 3.8.10

Interestingly, the kernel template in get_elwise_range_module() needs long i; and fails for unsigned long, suggesting that the running index has to be a signed integer. I haven't looked at the details, perhaps a dtype_to_ctype() fails somewhere when preparing the kernel.

I'll add the above tests in the pycuda test suite.

Side note, I have a failing test on the first platform with test_cumath.py::test_atan - for both main branch and this one. But the error seems to be only slightly above the threshold.

>               assert (max_err <= threshold).all(), (max_err, name, dtype)
E               AssertionError: (2.3841858e-07, 'atan', <class 'numpy.float32'>)
E               assert False
E                +  where False = <built-in method all of numpy.bool_ object at 0x7fa9540bd980>()
E                +    where <built-in method all of numpy.bool_ object at 0x7fa9540bd980> = 2.3841858e-07 <= 2e-07.all

/tmp/paleo/pycuda_upstream/test/test_cumath.py:44: AssertionError

Sep 21 '22 13:09 pierrepaleo

Thanks! Pushed to Gitlab for CI.

Sep 21 '22 15:09 inducer

Pushed again.

Sep 21 '22 15:09 inducer

You've got this marked "WIP" in the title. Is this ready to go from your end?

Sep 21 '22 15:09 inducer

Tests are now passing on gitlab CI. Good! Although the ones from this PR are skipped because none of the card has enough memory.

Before merging I'd like to rework the tests a little bit, as I feel they are cumbersome.

Basically the main constraint is that the tested arrays must have n_items >= 2**32 (4.3 GB for 1B dtypes, 17.2 GB for 4B dtypes). Unfortunately, numerical issues quickly pop up:

array[i] = i (elementwise) does not work for uint32 (overflow), nor for float32 (which cannot exactly represent all integers > 2**24). Fortunately only the index has to be tested, not the value, so we can just do array[i] = 1 and use a 1B datatype.
Summing 2**32 elements (reduction) ends up in overflow (uint32) or loss of precision (float32). Here again we can only test the running index i and not the value, so we can do something like (a + b) mod 256 and keep uint8 dtype.

Sep 22 '22 08:09 pierrepaleo

Should be better. The tests now require "only" 4.5 GB device memory (skipped otherwise) and should be faster.

Sep 22 '22 13:09 pierrepaleo

Pushed to Gitlab.

Sep 23 '22 02:09 inducer

Thanks! test passed on all 3 CI GPUs. This PR should be good now.

Sep 23 '22 06:09 pierrepaleo

Pushed to Gitlab.

Sep 27 '22 05:09 inducer

LGTM, thanks!

Sep 27 '22 14:09 inducer