OpenCL-Docs Waiting for Events in the Command Buffers Extension

I've been going through the proposed command buffer extension, to see if our application, PyFR, might benefit from it.

However, while attempting to port the OpenCL backend of our application over, I encountered an issue around events. Specifically, that it does not appear to be possible to wait on any of the kernels inside of a command buffer. This functionality is extremely useful in the context of HPC where it is necessary to interact with external libraries such as MPI. The idea is to have kernels which pack buffers and copy them to the host and then, as they complete, have the host-side kick off the relevant MPI_Isend calls. This is easy with regular OpenCL using events, and with CUDA/HIP's graph APIs using their event record nodes.

One solution here might be to add a

cl_int clCreateCommandSyncEventKHR(
   cl_command_buffer_khr command_buffer,
   cl_sync_point_khr* sync_point,
   cl_uint *ix
);

method to convert a sync point to an index and then a:

cl_int clEnqueueCommandBufferWithSyncEventsKHR(
     cl_uint num_queues,
     cl_command_queue* queues,
     cl_command_buffer_khr command_buffer,
     cl_uint num_sync_events,
     cl_event* sync_event_list,
     cl_uint num_events_in_wait_list,
     const cl_event* event_wait_list,
     cl_event* event);

which will mint a set of cl_events with the number of events being equal to the number of times clCreateCommandSyncEventKHR was called during recording. On platforms without any kind of native or preferred support for host side waiting, the functionality can be emulated internally without too much hassle (by cracking apart the command buffer into multiple smaller command buffers in an implementation defined manner).

May 09 '22 18:05 FreddieWitherden

Thanks very much for taking the time to provide feedback on the extension. From thinking this over am I right in understanding that it's only the device->host one-way synchronization that you're looking for here? Rather than the two-way synchronization that regular cl_events can provide (the host -> device sync via clSetUserEventStatus()). I think converting sync-points to cl_events will work for the one-way case but not the two-way case, unless I'm misunderstanding the proposal. Not that limiting things to one-way would be a problem, but want to understand the scope of extra functionality that would be useful.

An alternative solution for invoking MPI code would be to have an optional command-buffer command for doing a callback to user code, e.g

  clCommandUserCallbackKHR(
      cl_command_buffer_khr command_buffer,
      cl_command_queue command_queue,
      cl_command_user_callback_t user_function,  // Call host side code from here
      void* user_data,
      cl_uint num_sync_points_in_wait_list,
      const cl_sync_point_khr* sync_point_wait_list,
      cl_sync_point_khr* sync_point);

But that wouldn't be as nice if a user was wanting the cl_events converted from sync-points to trigger submission of new commands, so is a less invasive but also less powerful solution.

May 10 '22 14:05 EwanC

Yes, my particular need is only for device -> host synchronization. However, the other direction can also be supported easily enough via:

cl_sync_point *clCreateCommandUserSyncPointKHR(
   cl_command_buffer_khr command_buffer,
   cl_uint *ix, c_int *err
);

or something to that effect which creates a new sync point which can be used within the context of the command buffer. When clEnqueueCommandBufferWithSyncEventsKHR is called some of the returned events will be user events.

May 10 '22 16:05 FreddieWitherden

This is a really great question!

One of the nice properties (at least, from an implementor's perspective) of the current command buffer extension is that a command buffer is entirely "self-contained", meaning the commands within a command buffer can wait on other commands in a command buffer, but they cannot wait on events from outside of the command buffer, and commands outside of a command buffer cannot wait on a command buffer sync point. This gives an implementation a great deal of optimization freedom and avoids tricky issues like event lifetimes, though it is admittedly a little restrictive.

A few other solutions we could consider, in addition to the ones already described in this issue:

Could we allow signaling an OpenCL semaphore from a command buffer? This would most likely be useful for device-to-device synchronization while a command buffer is executing, but it could also be coupled with a wait on semaphore and an event callback (in a different queue?) to get the device-to-host behavior.
- Note, this would probably only work for a binary semaphore, or we would need a way to update the payload when signaling a timeline semaphore.
- Additonal note, we could allow waiting on a semaphore from a command buffer also, though this is a little trickier and could lead to deadlock, so we should treat signaling and waiting as two separate features.
Could implementations that support SVM or USM write to a host-accessible memory address to trivially provide device-to-host synchronization? Would doing so require any memory model considerations to guarantee that if the device-to-host signal is observed then any previous buffer writes are observable also?

May 19 '22 23:05 bashbaug

So I don't think extending to proposal to support event waiting would be the end of the world for vendors. If the functionality isn't used there is no burden, if it is used in the worse case they can just emulate the command buffer (without any hardware acceleration), or with some effort map a single OpenCL command buffer to a sequence of hardware command buffers and chain them together. In all cases, I think a performance win can be expected in overhead-heavy situations.

Semaphores seem like a good solution, although it looks as if one can not wait on a semaphore directly? Not sure if this is an oversight, or if there is a future extension planned. Having persistent event-type objects which can be waited on directly would greatly simplify code.

I think having the ability to enqueue a 32- or 64-bit value write to an arbitrary address would also be extremely useful (and functionality that is already in HIP and CUDA). If such an extension were available, it would be good if it supported command buffers too.

May 20 '22 21:05 FreddieWitherden