oneDNN icon indicating copy to clipboard operation
oneDNN copied to clipboard

[GPU] a non-blocking, profiler-based verbose mode for execution time tracking

Open avmanerikar opened this issue 3 months ago • 5 comments

Description

The PR implements a non-blocking verbose profiling to record and print primitive execution times without relying on blocking, host-side stream.wait() calls. The implementation is aimed at triaging two issues associated with the previous blocked approach and highlighted in MFDNN-12088: (i) timing inaccuracies arising from host-to-device synchronization overheads and (ii) performance discrepancies related to blocking stream.wait() calls.

Following the discussions in the RFC (PR #3393) and the PoC implementation (PR #3055), the PR implements the non-blocking verbose by leveraging the stream_profiler functionality to obtain device measured times for primitive execution and using asynchronous callback to print profiling info in a non-blocking manner. The implementation proposes the following changes to the verbose profiling:

  • enable stream profiling capabilities when DNNL_VERBOSE=profile_exec for GPU streams.
  • calculate primitive execution times by aggregating the timing data obtained from the stream profiler at the end of primtive execution.
  • use asynchronous callbacks to print verbose profiling info upon output event completion.

Addresses MFDNN-12088.

Considerations

  • The non-blocking verbose mode is enabled for OpenCL and SYCL GPU streams only, defaulting to a blocked implementation for CPU streams and other runtimes.
  • The non-blocking verbose mode is not supported for out-of-order queues and falls back to the blocked mode for such cases.
  • The asynchronous mode can be user-disabled using DNNL_DISABLE_ASYNC_VERBOSE to default to a blocking implementation for all cases.
DNNL_VERBOSE=profile_exec DNNL_DISABLE_ASYNC_VERBOSE=1 ./examples/primitives-matmul-cpp gpu

Related RFC: [link] Related PoC and discussion: [link]

RFC Checklist:

  • [x] Have you published an RFC for the new feature?
  • [ ] Was the RFC approved?
  • [ ] Have you added relevant tests?

avmanerikar avatar Oct 22 '25 17:10 avmanerikar

Because benchdnn also uses the stream_profiler functionality in perf mode, the verbose profiler is disabled when benchdnn is run in perf mode to record aggregate timing info.

There is a similar conflict with profiling API: https://uxlfoundation.github.io/oneDNN/dev_guide_experimental.html#onednn-experimental-profiling making usage of verbose functionality and profiling API exclusive.

I think this could be avoided by introducing a new stream_profiler_t object (dedicated only for verbose-related timing) and extending get_profiling_data() and similar APIs. This seems to be a better option to me as we will be able to keep benchmarking, verbose functionality and profiling API working independently, without extra cross-conditions.

@avmanerikar Do you see any issues with that?

echeresh avatar Oct 24 '25 22:10 echeresh

@avmanerikar Do you see any issues with that?

The trouble with having multiple profiler objects is that they will be tracking the same information and utilizing the same APIs throughout the stream lifetime. The conflicting behavior was due to the verbose profiling data being reset after primitive execution leaving an empty slate for profiling queries post-execution. I have adjusted the design to removing the resetting action which allows it to be used with the benchdnn perf mode and the experimental profiler.

avmanerikar avatar Oct 27 '25 17:10 avmanerikar

I have adjusted the design to removing the resetting action which allows it to be used with the benchdnn perf mode and the experimental profiler.

@avmanerikar Thanks for the update. Is concurrent access from the event callback and from profiling API a concern here?

echeresh avatar Oct 28 '25 00:10 echeresh

@avmanerikar Thanks for the update. Is concurrent access from the event callback and from profiling API a concern here?

This would be a concern if the profiler is reset by the experimental API while the verbose profiler is yet to print the timing data. But since the verbose profiler calculates and prints the timing data during the course of primitive execution, this is unlikely to happen.

avmanerikar avatar Oct 29 '25 18:10 avmanerikar

This would be a concern if the profiler is reset by the experimental API while the verbose profiler is yet to print the timing data. But since the verbose profiler calculates and prints the timing data during the course of primitive execution, this is unlikely to happen.

That makes sense. Alas, such rare issues are usually the hardest to debug. Can we somehow make this scenario to "never happen" instead of "unlikely to happen"?

echeresh avatar Oct 29 '25 20:10 echeresh

make test disable benchdnn_all set test_scope=NIGHTLY disable test_device_cpu enable test_device_gpu

avmanerikar avatar Dec 17 '25 00:12 avmanerikar