llvm [SYCL][DOC] Graph API extensions

This extension is adding an API that targets separating definition and execution of a SYCL Kernel Graph. The patch contains an early stage prototype that enables replay of a single Level Zero Command List.

Feb 22 '22 03:02 reble

Interesting concepts! Perhaps the naming could be tweaked to be more SYCL-friendly? Does the _node suffix really matter? It looks like this is what in SYCL we name a command group. But it could be seen as a task or similar. add_ looks like submit so why not keeping the same wording? I guess you want to differentiate submit_device from submit_host, so adding _device and _host makes sense. But the host is also a device... add_empty_node can be just an overload submit_device().

Mar 05 '22 03:03 keryell

But the host is also a device...

...not anymore. ;)

Mar 17 '22 19:03 jbrodman

But the host is also a device...

...not anymore. ;)

You do not have an extension for this? :-) It can be an optional feature in all the good implementations when a good CPU is available. ;-)

Mar 17 '22 21:03 keryell

That looks better! Thanks.

Sep 30 '22 07:09 keryell

Tagging @gmlueck for feedback.

Dec 13 '22 05:12 reble

Our final draft for revision 1 of the proposed Command Graph extension is ready for review. Tagging @intel/dpcpp-specification-reviewers

Mar 24 '23 14:03 reble

This is our final draft for revision 1 of the proposed extension. All feedback has been addressed. Tagging @intel/dpcpp-specification-reviewers

Apr 18 '23 13:04 reble

Have you thought about a device-agnostic graph built using the explicit API that can be submitted to queues to a particular devices? I can see a use case where you build a graph and then send it to multiple devices.m, either independently or to different devices at the same time. We need to think about how to define memory inputs/outputs and dependencies to the graph in that use case.

Don’t forget the feedback from IWOCL from @illuhad too!

Apr 20 '23 13:04 tomdeakin

Have you thought about a device-agnostic graph built using the explicit API that can be submitted to queues to a particular devices? I can see a use case where you build a graph and then send it to multiple devices.m, either independently or to different devices at the same time. We need to think about how to define memory inputs/outputs and dependencies to the graph in that use case.

Thanks for your feedback.

Yes we considered a command graph that can be executed on multiple devices. In summary, we had to drop this approach because of runtime limitations. You’d need an intermediate representation that is entirely backend independent. Main issue is data used within a CGF that is captured by reference, or could go out of scope. Another issue are device-specific parameters like work group sizes. It’s an open which restrictions we would have to apply to make this work. That’s why we haven’t included this feature into the initial revision of the proposal. But will reconsider this support for future version of the extension.

Apr 26 '23 13:04 reble

@EwanC asked me to elaborate my feedback from IWOCL, so here goes:

Fundamentally, I am not sure whether we need such a large new API surface, and I wonder whether this could be implemented in a simpler, more SYCL-idiomatic way by leaving behind the notion that a "queue" is a queue as we know it e.g. from OpenCL, which I think is misleading. A queue already is about constructing a task graph, and indeed hipSYCL thinks about queues merely as an interface to a task graph, such that a more appropriate name for that object might actually be "task group" or similar.

So I wonder whether we could achieve something similar by

providing user control over the flush behavior of a queue. That would not only allow for the opposite extreme of an explicit eager submission behavior (which is something my users have repeatedly requested in the past), but the other extreme of only flushing in the queue destructor could be mapped to submitting a graph. (Remember that a queue can be thought of as a task group!)
In that case, there would need to be a mechanism to return an object from the queue that can also be replayed.

Such an implementation would also be less at risk of feature divergence between the queue and the task graph API.

Apr 26 '23 14:04 illuhad

@EwanC asked me to elaborate my feedback from IWOCL, so here goes:

Fundamentally, I am not sure whether we need such a large new API surface, and I wonder whether this could be implemented in a simpler, more SYCL-idiomatic way by leaving behind the notion that a "queue" is a queue as we know it e.g. from OpenCL, which I think is misleading. A queue already is about constructing a task graph, and indeed hipSYCL thinks about queues merely as an interface to a task graph, such that a more appropriate name for that object might actually be "task group" or similar.

So I wonder whether we could achieve something similar by

providing user control over the flush behavior of a queue. That would not only allow for the opposite extreme of an explicit eager submission behavior (which is something my users have repeatedly requested in the past), but the other extreme of only flushing in the queue destructor could be mapped to submitting a graph. (Remember that a queue can be thought of as a task group!)

In that case, there would need to be a mechanism to return an object from the queue that can also be replayed.

Such an implementation would also be less at risk of feature divergence between the queue and the task graph API.

A property like "deferred_execution" supplied to the queue constructor could instruct SYCL to defer execution until an explicit call to queue.wait() (or to event.wait, I guess) -- the queue.flush() already exists and is called queue.wait(). A call to a wait method for a "deferred_execution" queue means "launch everything in the task graph leading up to this item, inclusive, then wait for this item".
A property like "replayable" supplied to the queue constructor could instruct SYCL to keep the task graph around until the destructor is called -- no new "object that can be replayed" is needed, replay could be expressed with a new 'replay' method on the queue. Would queue.replay() implicitly contain a queue.wait() or does it require the user to call wait again? For a "deferred_execution" queue, the wait would be queue.wait(), otherwise it could be event.wait() using the event returned by queue.replay.

More important question: is "replay" valid? Captured variables might have been changed by the first/previous invocation and there should be concerns about scoping for byref captures.

Apr 26 '23 15:04 Wee-Free-Scot

@Wee-Free-Scot Interesting approach too.

   More important question: is "replay" valid? Captured variables might have been changed by the first/previous invocation and there should be concerns about scoping for byref captures.

I guess the original proposal here has also the same problem.

Apr 27 '23 07:04 keryell

@EwanC asked me to elaborate my feedback from IWOCL, so here goes: Fundamentally, I am not sure whether we need such a large new API surface, and I wonder whether this could be implemented in a simpler, more SYCL-idiomatic way by leaving behind the notion that a "queue" is a queue as we know it e.g. from OpenCL, which I think is misleading. A queue already is about constructing a task graph, and indeed hipSYCL thinks about queues merely as an interface to a task graph, such that a more appropriate name for that object might actually be "task group" or similar. So I wonder whether we could achieve something similar by

providing user control over the flush behavior of a queue. That would not only allow for the opposite extreme of an explicit eager submission behavior (which is something my users have repeatedly requested in the past), but the other extreme of only flushing in the queue destructor could be mapped to submitting a graph. (Remember that a queue can be thought of as a task group!)

In that case, there would need to be a mechanism to return an object from the queue that can also be replayed.

Such an implementation would also be less at risk of feature divergence between the queue and the task graph API.
1. A property like "deferred_execution" supplied to the queue constructor could instruct SYCL to defer execution until an explicit call to `queue.wait()` (or to `event.wait`, I guess) -- the `queue.flush()` already exists and is called `queue.wait()`. A call to a wait method for a "deferred_execution" queue means "launch everything in the task graph leading up to this item, inclusive, then wait for this item".

2. A property like "replayable" supplied to the queue constructor could instruct SYCL to keep the task graph around until the destructor is called -- no new "object that can be replayed" is needed, replay could be expressed with a new 'replay' method on the queue. Would `queue.replay()` implicitly contain a `queue.wait()` or does it require the user to call `wait` again? For a "deferred_execution" queue, the wait would be `queue.wait()`, otherwise it could be `event.wait()` using the event returned by `queue.replay`.
   More important question: is "replay" valid? Captured variables might have been changed by the first/previous invocation and there should be concerns about scoping for byref captures.

Thanks for the follow-up comments :)

Although I agree a queue lets you define a task-graph, it's also how that graph is submitted to the device. The main goal we're trying to achieve is splitting these two concerns of command creation & execution, which I feel is cleanest by providing an object that isn't related to execution, rather than trying to overload existing queue execution behaviours.

If queue::flush() or queue::wait() are used with a deferred queue execution model then I see two drawbacks compared to the current approach. 1) graph will only ever be submitted once. 2) the operation will perform both finalize work of creating backend objects and doing optimization as well as execution, so these concerns are still linked. Therefore, I think we'd still need a method on the queue equivalent to finalize() that can be passed properties for things like kernel-fusion, and the returned handle will look like command_graph.

I'm also not sure if tying a graph to a single queue is a limitation. For a multi-device graph to be based on a single queue, then we'd also need an extension like HIPSYCL_EXT_MULTI_DEVICE_QUEUE to make an individual queue multi-device. Although this multi-device feature is still to be designed regardless. If we had an API like queue.replay() is would imply the queue only every had one graph however.

I guess the original proposal here has also the same problem.

I think so, command-groups would be captured as nodes in the same way in both approaches.

Apr 27 '23 20:04 EwanC

@Wee-Free-Scot Interesting approach too.
   More important question: is "replay" valid? Captured variables might have been changed by the first/previous invocation and there should be concerns about scoping for byref captures.
I guess the original proposal here has also the same problem.

Correct.

Anything captured by reference and temporaries that could go out of scope require a CGF to execute eagerly. This includes defining a device kernel and setting its arguments, but not spawning it yet. Referring to my previous comment, that's why we need to set a device and context with command graph creation. Implementation details: https://github.com/reble/llvm/pull/87 Also some restrictions are required on buffer write-back: https://github.com/reble/llvm/blob/sycl-graph-update/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc#711-buffer-limitations-for-record--replay-api

Apr 27 '23 21:04 reble

If queue::flush() or queue::wait() are used with a deferred queue execution model then I see two drawbacks compared to the current approach. 1) graph will only ever be submitted once. 2) the operation will perform both finalize work of creating backend objects and doing optimization as well as execution, so these concerns are still linked. Therefore, I think we'd still need a method on the queue equivalent to finalize() that can be passed properties for things like kernel-fusion, and the returned handle will look like command_graph.

No, this is not what I had in mind. I was thinking more of something like:


auto exec_handle = std::make_shared<execution_handle>();

{
  // The property could also be tied to/referenced as forward-progress progress behavior
  // and also allow to define eager queues that are supposed to submit as early as possible
  sycl::queue q{sycl::property::execution_behavior{sycl::deferred, exec_handle}};

  // Submit operations here
  q.parallel_for(....);
  ...

  // For deferred execution, graph would be finalized in the destructor and available for
  // execution in exec_handle
}

// Can be invoked multiple times
exec_handle->run();

I'm also not sure if tying a graph to a single queue is a limitation. For a multi-device graph to be based on a single queue, then we'd also need an extension like HIPSYCL_EXT_MULTI_DEVICE_QUEUE to make an individual queue multi-device.

Oh yes. There's no reason why a SYCL queue should be tied to just one device, it's just a remnant from SYCL objects mapping 1:1 to OpenCL objects in SYCL 1.2.1.

Apr 28 '23 13:04 illuhad

auto exec_handle = std::make_shared<execution_handle>();

{
  // The property could also be tied to/referenced as forward-progress progress behavior
  // and also allow to define eager queues that are supposed to submit as early as possible
  sycl::queue q{sycl::property::execution_behavior{sycl::deferred, exec_handle}};

  // Submit operations here
  q.parallel_for(....);
  ...

  // For deferred execution, graph would be finalized in the destructor and available for
  // execution in exec_handle
}

// Can be invoked multiple times
exec_handle->run();

Thanks for the code snippet that's very helpful, I can see the flexibility that an execution behaviour property gives. My thoughts overall are:

This proposal could be more difficult for libraries. For example, if a library API accepts a queue as parameter, then the library can't stop recording and start a new recording, because it would need to destroy the queue to stop the recording, and it can't return a new queue.
We would need to have wording to specify a new execution model to accommodate run(), e.g. defining where the handle runs and how it synchronizes with other submissions. I think this is technically possible, but an observation that this would need careful consideration.
Not sure if sub-graphs would be expressible. Right now, we can capture a child graph as a node of the parent graph by using handler::ext_oneapi_graph in a node being adding to the graph.

May 03 '23 08:05 EwanC

This proposal could be more difficult for libraries. For example, if a library API accepts a queue as parameter, then the library can't stop recording and start a new recording, because it would need to destroy the queue to stop the recording, and it can't return a new queue.

This point seems easy to solve -- some libraries don't need that level of control over the queue they are passed, e.g. oneMKL just adds a kernel to the queue and gives you back the event generated from that submission -- it doesn't care about the actual execution, it leaves that up to you (call event::wait on the event, pass the event to an h_depends, ignore the event and call queue::wait instead, whatever). Any library that wants lifetime control of the queue itself should probably be creating its own queue (with whatever properties it wants) -- it could query the device and context (and property list) from a passed-in queue and use those or not, as it wishes. Any library that doesn't want lifetime control of the queue itself but does want to wait for completion of particular submissions to the queue should be given the events for any significant submissions that occurred before the library call and should keep hold of events generated from submissions made within the library call -- the task graph is a dAg so there cannot be any deadlocks arising from waiting for any single node (as long as wait is equivalent to queue::get_exec_handle::run_until(sycl::event) for a queue with the deferred property, i.e. as long as calling event::wait causes progress to start if it isn't started already). In such a situation, it would be nice if the library documented that it has side-effects, i.e. it starts execution progress, which might cause the entire pre-call graph to execute during the library call. Whether the graph execution returns to "deferred" after the run_until or not is part of (2) execution model, which is a harder nut to crack.

So, in addition to exec_handle::run() for the owner of a deferred queue, point (1)would need exec_handle::run_until(event) for borrowers of a deferred queue.

Points (2) and (3) are more tricksy.

May 03 '23 12:05 Wee-Free-Scot

oneMKL just adds a kernel to the queue and gives you back the event generated from that submission -- it doesn't care about the actual execution

In fact, this can be more complex, even for MKL. Part of the library call might be launched only once and is not intended to be recorded. More complexity comes from temporary storage (i.e., using USM device memory) used by the library. Without the ability to bind the lifetime of an allocation to a command graph (which is a known gap for USM in our current proposal), the library must ensure that all submitted work is completed before it can safely do the deallocation. This is not possible when we apply a lazy execution model like the proposed graph extension. These are not blocking issues, but arguments for a library to become graph aware and do things different when in recording mode.

May 04 '23 16:05 reble

Precommit failure is a known issue (https://github.com/intel/llvm/issues/10380)

Jul 19 '23 07:07 steffenlarsen