common: verbose: asynchronous verbose mode for execution time tracking
Description
This PR proposes a PoC for introducing an asynchronous verbose mode to accurately track kernel execution times in a non-blocking manner with minimal synchronization latencies. For the verbose mode, retrieving the kernel timing causes significant overhead as it requires the GPU kernel execution to be synchronized and also because it is tracked on the host. The asynchronous mode removes the synchronization overhead by using event callbacks to query execution timings. The prototype is created for a OpenCL GPU API that provides the kernel execution statistics for profiling.
The implementation will be added as an experimental functionality enabled during build-time with DNNL_EXPERIMENTAL_ASYNC_VERBOSE:
cmake .. -DDNNL_EXPERIMENTAL=ON -DDNNL_EXPERIMENTAL_ASYNC_VERBOSE=ON -DDNNL_EXPERIMENTAL_PROFILING=ON -DDNNL_GPU_RUNTIME=OCL
Related RFC: [link]
Addresses MFDNN-13603.
Checklist
- [x] Have you published an RFC for the new feature?
- [ ] Was the RFC approved?
- [ ] Have you added relevant tests?