pyopencl icon indicating copy to clipboard operation
pyopencl copied to clipboard

optimization: support for 2-D/3-D arrays with strides >= for better memory bandwidth utilization

Open nevion opened this issue 11 years ago • 2 comments

http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide/#50401334_pgfId-472173

You can see that, depending on graphics card model, there are stride values to avoid - funny enough sometimes with strides of powers of 2 - fairly common numbers. If you ignore this, you can get into trouble and under utilize the memory channels available in hardware in addition to increasing bank conflicts (so the hit is 2 fold). The fix is to simply allocate a slightly pad the strides to avoid these conflicts (found via vendor/model documentation). This should be of use to any GPU implementation and likely some accelerators. It should not be performed when the device type is CPU or code that is doing vector operations directly on memory where the hardware has alignment restrictions.

As such, I tried to use such an array with pyopencl but it doesn't work out very well in manipulation/printing/inspection:

    def make_cl_image_buffer(queue, img_dims, img_dtype, linestep = None):
        img_dtype = dtype_of(img_dtype)
        if linestep is None:
            linestep = img_dims[1] * img_dtype.itemsize
            if queue.device.type == cl.device_type.GPU:
                #make sure to have an uneven line width for better memory channel utilization, as per AMD recommendation
                if linestep % 2 == 0:
                    linestep += 1
        buf = cl.Buffer(queue.context, cl.mem_flags.READ_WRITE, img_dims[0]*linestep)
        return buf, linestep

    def make_cl_image(queue, img_dims, img_dtype, linestep = None):
        img_dtype = dtype_of(img_dtype)
        img_buf, linestep = make_cl_image_buffer(queue, img_dims, img_dtype, linestep)
        strides = (linestep, dtype_of(img_dtype).itemsize)
        img = clarray.Array(queue, img_dims, img_dtype, strides=strides, data=img_buf)
        return img


In [3]: img=com.make_cl_image(queue, (2044, 2044), np.uint32)

In [12]: img.shape
Out[12]: (2044, 2044)

In [13]: img.strides
Out[13]: (8177, 4)
    /home/jason/tmp/pyopencl.git/pyopencl/array.py in get(self, queue, ary, async)
        685                 raise TypeError("'ary' has non-matching type")
        686
    --> 687         assert self.flags.forc, "Array in get() must be contiguous"
        688
        689         if self.size:

    AssertionError: Array in get() must be contiguous

I searched around and found someone else curious about this last year - he also put some work into this: https://github.com/pyopencl/pyopencl/pull/54

I wouldn't mind adding some lines to the library to accomplish good-no-surprises manipulation but I was unsure the best way to do it since it seems everything relies on the element wise kernels. I think the best/most compatible way of doing this is having the element wise kernels take in an additional stride parameter which they use to calculate the offset of the element they process.

What do you think? Is it possible for you to add this? If you don't have the time - can you outline what is the way to get it done?

ps. Tried bringing this up for discussion on the ML, got stuck in moderation.

nevion avatar Jan 10 '15 02:01 nevion

PyOpenCL's arrays are, for now, restricted to being contiguous for most operations.

To my mind, the correct way to implement operations on multi-dimensional arrays would use loopy.

inducer avatar Jan 18 '15 01:01 inducer

loopy is a cool thing to have learned about - I see libraries like that pop-up from academic labs in a few different contexts.

But what we're talking about on the kernel side is probably no code changes to anything that takes in pitches/strides in bytes - ie it assumes unpacked layout. If the stride is a bad value for certain GPUs than one would want to set it >= the packed value and this can greatly change memory bandwidth utilization / bank conflicts.

We just need a small bit more support to make allocating those arrays not be too painful - of course it's best if it works on everything that is currently supported. The most important things are -get matrix values -set from or to values from a packed numpy array or another clarray -set to constant value

nevion avatar Jan 18 '15 01:01 nevion