Multithreading - threads block one another on return.
Describe the bug When running workloads on multiple GPUs, where each workload is on a seperate thread which contains its own context/queue/device, we see threads block until all are complete. IE a workload may finish very quickly on one GPU and still have lots to do on another. The thread which completes all its work is stuck until others complete. There is no data shared across threads.
Looking into it a bit further the issue seems to be related to event cleanup. We have narrowed down the blocked threads to being stuck deleting events from kernel enqueues.
To Reproduce python script below, assumes platforms[0] is Nvidia and has multiple GPUs. I believe the same is reproduced with a CPU + GPU device.
this launches two threads with disparate amount of work between the two. During execution t2 gets to point of returning, but is not joined until t1 completes.
import threading
import pyopencl
import numpy
def gpu_workload(gpu_idx, num_indexes):
platforms = pyopencl.get_platforms()
devices = platforms[0].get_devices()
device = devices[gpu_idx]
context = pyopencl.Context([device])
queue = pyopencl.CommandQueue(context)
data = numpy.random.rand(num_indexes, num_indexes).astype(numpy.float32)
dest = numpy.ones((num_indexes * num_indexes), dtype=numpy.float32)
# create buffers
with pyopencl.CommandQueue(context) as queue:
data_buf = pyopencl.Buffer(
queue.context,
pyopencl.mem_flags.READ_ONLY | pyopencl.mem_flags.COPY_HOST_PTR,
hostbuf=data,
)
dest_buf = pyopencl.Buffer(
queue.context, pyopencl.mem_flags.WRITE_ONLY, dest.nbytes
)
# create kernel with disabled optimizations
prg = pyopencl.Program(
queue.context,
"""
__kernel void foo(ushort n, __global float *a, __global float *c)
{
int gid = get_global_id(0);
c[gid] = 0.0f;
int rowC = gid/n;
int colC = gid%n;
__global float *pA = &a[rowC*n];
__global float *pB = &a[colC];
for(int k=0; k<n; k++)
{
pB = &a[colC+k*n];
for(int j=0; j<1; j++)
{
c[gid] += (*(pA++))*(*pB);
}
}
c[gid] -= c[gid];
}
""",
).build(options=["-cl-opt-disable"])
kernel = prg.foo
print(f"gpu {gpu_idx} work beginning")
kernel_event = kernel(
queue,
dest.shape,
None,
numpy.uint16(num_indexes),
data_buf,
dest_buf,
)
# create output of all ones -> program should set it all to zeros
output = numpy.ones_like(dest)
copy_event = pyopencl.enqueue_copy(queue, output, dest_buf, is_blocking=True)
print(f"gpu {gpu_idx} work done.")
print(f"gpu {gpu_idx} returning")
return
def main():
t1 = threading.Thread(target=gpu_workload, args=(0, 4096 * 2,))
t2 = threading.Thread(target=gpu_workload, args=(1, 4096,))
t1.start()
t2.start()
t2.join()
print("shorter gpu join done")
t1.join()
print("short gpu join done")
if __name__ == "__main__":
main()
Expected behavior Threads using separate context/queue should not block another AFAIK.
Environment (please complete the following information):
- OS: Linux SLES 15 SP6
- ICD Loader and version: [e.g. ocl-icd 2.3.1]: not sure
- ICD and version: nvidia
- CPU/GPU: Nvidia RTX A5000, 550/570 drivers tested
- Python version: 3.9/3.10/3.11
- PyOpenCL version: 2025.1
Additional context In our full application we run what is similar to this in an asyncio.to_thread as a part of a greater RESTful API application. When the thread blocking occurs it seems to be holding GIL as we cannot interact with any part of our application, uses 100% of CPU during this time as well. From nsys it appears to be constantly calling clReleaseEvent.
We are fairly new to using PyOpenCL, so it could be that we are missing some property/documentation on thread-safety/blocking across contexts. Thank you!
Thanks for the report! I can reproduce what you describe, and I can offer a workaround: add copy_event.wait() to the end of the routine. This helps because user-initiated wait releases the GIL, but the implicit wait in the destructor does not, preventing progress.
I'm not sure the issue itself can be meaningfully fixed, as doing so would require that the GIL be released in the middle of a destructor. This was pyopencl's original behavior, but it led to hard-to-debug crashes, see here:
- https://gitlab.tiker.net/inducer/pyopencl/-/merge_requests/92/diffs
- https://github.com/inducer/pyopencl/issues/296
To be fair, we've moved on to nanobind (from pybind), but still I'm not sure I'm brave enough to open that can of worms again. 🤔
Thanks for the quick response! I added an explicit wait() to the copy event. I will have to do further testing with our full application to see if GIL is released at this point and we can interact with the full app, but it seems the thread join is still blocked. Added timestamps to see:
We see the shorter GPU workload (gpu 1 here) finish at 37:56, but can't join until gpu 0 finishes at 38:17. Is this expected?
Just to follow up, it seems our application is still blocked during the time a thread is stuck returning. Just want to know if it is expected behavior for the threads to block when waiting on events created from separate cmd queues. The issues only arises for us during the time a thread is stuck (IE 20:37:56-20:38:17) in above example.