pyopencl icon indicating copy to clipboard operation
pyopencl copied to clipboard

Multithreading - threads block one another on return.

Open jricker2 opened this issue 9 months ago • 3 comments

Describe the bug When running workloads on multiple GPUs, where each workload is on a seperate thread which contains its own context/queue/device, we see threads block until all are complete. IE a workload may finish very quickly on one GPU and still have lots to do on another. The thread which completes all its work is stuck until others complete. There is no data shared across threads.

Looking into it a bit further the issue seems to be related to event cleanup. We have narrowed down the blocked threads to being stuck deleting events from kernel enqueues.

To Reproduce python script below, assumes platforms[0] is Nvidia and has multiple GPUs. I believe the same is reproduced with a CPU + GPU device.

this launches two threads with disparate amount of work between the two. During execution t2 gets to point of returning, but is not joined until t1 completes.

import threading
import pyopencl
import numpy

def gpu_workload(gpu_idx, num_indexes):
    platforms = pyopencl.get_platforms()
    devices = platforms[0].get_devices()
    device = devices[gpu_idx]
    context = pyopencl.Context([device])
    queue = pyopencl.CommandQueue(context)

    data = numpy.random.rand(num_indexes, num_indexes).astype(numpy.float32)
    dest = numpy.ones((num_indexes * num_indexes), dtype=numpy.float32)

    # create buffers
    with pyopencl.CommandQueue(context) as queue:
        data_buf = pyopencl.Buffer(
            queue.context,
            pyopencl.mem_flags.READ_ONLY | pyopencl.mem_flags.COPY_HOST_PTR,
            hostbuf=data,
        )

        dest_buf = pyopencl.Buffer(
            queue.context, pyopencl.mem_flags.WRITE_ONLY, dest.nbytes
        )

        # create kernel with disabled optimizations
        prg = pyopencl.Program(
            queue.context,
            """
        __kernel void foo(ushort n, __global float *a, __global float *c)
        {
          int gid = get_global_id(0);
          c[gid] = 0.0f;
          int rowC = gid/n;
          int colC = gid%n;
          __global float *pA = &a[rowC*n];
          __global float *pB = &a[colC];
          for(int k=0; k<n; k++)
          {
             pB = &a[colC+k*n];
             for(int j=0; j<1; j++)
             {
             c[gid] += (*(pA++))*(*pB);
             }
          }

          c[gid] -= c[gid];
        }
        """,
        ).build(options=["-cl-opt-disable"])
        kernel = prg.foo


        print(f"gpu {gpu_idx} work beginning")
        kernel_event = kernel(
            queue,
            dest.shape,
            None,
            numpy.uint16(num_indexes),
            data_buf,
            dest_buf,
        )
        # create output of all ones -> program should set it all to zeros
        output = numpy.ones_like(dest)
        copy_event = pyopencl.enqueue_copy(queue, output, dest_buf, is_blocking=True)
        print(f"gpu {gpu_idx} work done.")

    print(f"gpu {gpu_idx} returning")
    return


def main():
    t1 = threading.Thread(target=gpu_workload, args=(0, 4096 * 2,))
    t2 = threading.Thread(target=gpu_workload, args=(1, 4096,))
    t1.start()
    t2.start()
    t2.join()
    print("shorter gpu join done")
    t1.join()
    print("short gpu join done")

if __name__ == "__main__":
    main()

Expected behavior Threads using separate context/queue should not block another AFAIK.

Environment (please complete the following information):

  • OS: Linux SLES 15 SP6
  • ICD Loader and version: [e.g. ocl-icd 2.3.1]: not sure
  • ICD and version: nvidia
  • CPU/GPU: Nvidia RTX A5000, 550/570 drivers tested
  • Python version: 3.9/3.10/3.11
  • PyOpenCL version: 2025.1

Additional context In our full application we run what is similar to this in an asyncio.to_thread as a part of a greater RESTful API application. When the thread blocking occurs it seems to be holding GIL as we cannot interact with any part of our application, uses 100% of CPU during this time as well. From nsys it appears to be constantly calling clReleaseEvent.

We are fairly new to using PyOpenCL, so it could be that we are missing some property/documentation on thread-safety/blocking across contexts. Thank you!

jricker2 avatar Apr 10 '25 22:04 jricker2

Thanks for the report! I can reproduce what you describe, and I can offer a workaround: add copy_event.wait() to the end of the routine. This helps because user-initiated wait releases the GIL, but the implicit wait in the destructor does not, preventing progress.

I'm not sure the issue itself can be meaningfully fixed, as doing so would require that the GIL be released in the middle of a destructor. This was pyopencl's original behavior, but it led to hard-to-debug crashes, see here:

  • https://gitlab.tiker.net/inducer/pyopencl/-/merge_requests/92/diffs
  • https://github.com/inducer/pyopencl/issues/296

To be fair, we've moved on to nanobind (from pybind), but still I'm not sure I'm brave enough to open that can of worms again. 🤔

inducer avatar Apr 10 '25 23:04 inducer

Thanks for the quick response! I added an explicit wait() to the copy event. I will have to do further testing with our full application to see if GIL is released at this point and we can interact with the full app, but it seems the thread join is still blocked. Added timestamps to see:

Image

We see the shorter GPU workload (gpu 1 here) finish at 37:56, but can't join until gpu 0 finishes at 38:17. Is this expected?

jricker2 avatar Apr 11 '25 01:04 jricker2

Just to follow up, it seems our application is still blocked during the time a thread is stuck returning. Just want to know if it is expected behavior for the threads to block when waiting on events created from separate cmd queues. The issues only arises for us during the time a thread is stuck (IE 20:37:56-20:38:17) in above example.

jricker2 avatar Apr 17 '25 19:04 jricker2