performance issue in the dpct program for stream create, copy, destroy
There is some performance gap between the cuda and dpct programs when they run on an NVIDIA GPU (e.g. 3090). Thanks for your review.
DPCT program:https://github.com/zjin-lcf/HeCBench/blob/master/src/streamCreateCopyDestroy-sycl/main.cpp Create+Copy+Synchronize+Destroy time for 1 streams and 5000 buffers and 16 iterations 17.0295 (ms) Create+Copy+Synchronize+Destroy time for 2 streams and 5000 buffers and 8 iterations 17.0324 (ms) Create+Copy+Synchronize+Destroy time for 4 streams and 5000 buffers and 4 iterations 17.0522 (ms) Create+Copy+Synchronize+Destroy time for 8 streams and 5000 buffers and 2 iterations 16.9666 (ms)
CUDA program: https://github.com/zjin-lcf/HeCBench/blob/master/src/streamCreateCopyDestroy-cuda/main.cu Create+Copy+Synchronize+Destroy time for 1 streams and 5000 buffers and 16 iterations 12.8682 (ms) Create+Copy+Synchronize+Destroy time for 2 streams and 5000 buffers and 8 iterations 12.799 (ms) Create+Copy+Synchronize+Destroy time for 4 streams and 5000 buffers and 4 iterations 12.815 (ms) Create+Copy+Synchronize+Destroy time for 8 streams and 5000 buffers and 2 iterations 12.8682 (ms)
@jinz2014 After a further investigation, the performance gap is introduced by SYCL runtime; my advice is that you can report this issue to sycl runtime at https://github.com/intel/llvm/issues.
There is an update from https://github.com/oneapi-src/unified-runtime/issues/1589.
@jinz2014
- DPCT/SYCLomatic queues for USM already use in-order queue.
- The discard events cannot be enabled in the DPCT/SYCLomatic queues for USM, according to the definition of sycl_ext_oneapi_discard_queue_events,
This extension adds ext::oneapi::property::queue::discard_events property for sycl::queue, by using this property the application informs a SYCL implementation that it will not use the event returned by any of the queue member functions. (i.e submit, parallel_for, copy, memset and others.)While DPCT/SYCLomatic needs to use event returns from the queue member functions to do sync operations in the migrated code.
I ran the SYCL program with and without the discard queue event on an NVIDIA GPU. The discard queue event does not seem to have impact on the performance. You may run the SYCL (main.cpp) and DPCT (main-dpct.cpp) programs for performance comparison.
https://github.com/zjin-lcf/HeCBench/tree/master/src/streamCreateCopyDestroy-sycl
Hi @jinz2014, I have reproduced your performance result, which shows no difference between SYCL (main.cpp) and DPCT (main-dpct.cpp). So, the summary is that the performance gap is introduced by SYCL runtime.
Let me clarify my comments. The execution time of 'main.cpp' is very close to that of the CUDA program 'main.cu'. However, the execution time of 'main-dpct.cpp' is not. Regardless of the discard event, there is a performance gap between main.cpp and main-dpct.cpp. I am not sure if this is introduced by the SYCL runtime.
@jinz2014
Based my experiment result:
The performance gain in main.cpp is mainly from sycl::ext::oneapi::property::queue::discard_events() is used in https://github.com/zjin-lcf/HeCBench/blob/88c5bcb94f68003ba304b6d373f9a4641dee5e2e/src/streamCreateCopyDestroy-sycl/main.cpp#L78.
If we remove the sycl::ext::oneapi::property::queue::discard_events() in main.cpp, there will be no performance difference.
Could you try to rerun the test after removing sycl::ext::oneapi::property::queue::discard_events() in https://github.com/zjin-lcf/HeCBench/blob/88c5bcb94f68003ba304b6d373f9a4641dee5e2e/src/streamCreateCopyDestroy-sycl/main.cpp#L78?
If there is still a performance gap after sycl::ext::oneapi::property::queue::discard_events() is removed in main.cpp, pls provide your test env info, so that we can reproduce it in the same test env and do a further investigation.
On a V100 GPU, the version of the CUDA compilation tools is 12.3:
The results of the CUDA program: Create+Copy+Synchronize+Destroy time for 1 streams and 5000 buffers and 16 iterations 15.6444 (ms) Create+Copy+Synchronize+Destroy time for 2 streams and 5000 buffers and 8 iterations 15.6255 (ms) Create+Copy+Synchronize+Destroy time for 4 streams and 5000 buffers and 4 iterations 15.6548 (ms) Create+Copy+Synchronize+Destroy time for 8 streams and 5000 buffers and 2 iterations 15.6429 (ms)
The SYCL program is built with the oneAPI 2024.2 with the prebuilt CUDA plugin from Codeplay.
The results of the SYCL program with the "discard_events" option: Create+Copy+Synchronize+Destroy time for 1 streams and 5000 buffers and 16 iterations 23.9135 (ms) Create+Copy+Synchronize+Destroy time for 2 streams and 5000 buffers and 8 iterations 23.9091 (ms) Create+Copy+Synchronize+Destroy time for 4 streams and 5000 buffers and 4 iterations 23.8957 (ms) Create+Copy+Synchronize+Destroy time for 8 streams and 5000 buffers and 2 iterations 23.8987 (ms)
The results of the SYCL program without the "discard_events" option: Create+Copy+Synchronize+Destroy time for 1 streams and 5000 buffers and 16 iterations 23.9317 (ms) Create+Copy+Synchronize+Destroy time for 2 streams and 5000 buffers and 8 iterations 23.9154 (ms) Create+Copy+Synchronize+Destroy time for 4 streams and 5000 buffers and 4 iterations 23.9224 (ms) Create+Copy+Synchronize+Destroy time for 8 streams and 5000 buffers and 2 iterations 23.9258 (ms)
@jinz2014 According to the description of discard_event in https://github.com/oneapi-src/unified-runtime/issues/1589 and our test result mentioned above, discard_event should impact performance. The only difference between your environment and mine is the CUDA plugin. Could you try to build and install the open-source version on https://github.com/intel/llvm?
There is a performance impact in the open-source version.
Hi @jinz2014 Since discard_event does impact performance with open-source version. I suggest you report this issue to codeplay team on https://developer.codeplay.com/products/oneapi/nvidia/feedback/.