SYCLomatic icon indicating copy to clipboard operation
SYCLomatic copied to clipboard

performance issue in the dpct program for stream create, copy, destroy

Open jinz2014 opened this issue 1 year ago • 10 comments

There is some performance gap between the cuda and dpct programs when they run on an NVIDIA GPU (e.g. 3090). Thanks for your review.

DPCT program:https://github.com/zjin-lcf/HeCBench/blob/master/src/streamCreateCopyDestroy-sycl/main.cpp Create+Copy+Synchronize+Destroy time for 1 streams and 5000 buffers and 16 iterations 17.0295 (ms) Create+Copy+Synchronize+Destroy time for 2 streams and 5000 buffers and 8 iterations 17.0324 (ms) Create+Copy+Synchronize+Destroy time for 4 streams and 5000 buffers and 4 iterations 17.0522 (ms) Create+Copy+Synchronize+Destroy time for 8 streams and 5000 buffers and 2 iterations 16.9666 (ms)

CUDA program: https://github.com/zjin-lcf/HeCBench/blob/master/src/streamCreateCopyDestroy-cuda/main.cu Create+Copy+Synchronize+Destroy time for 1 streams and 5000 buffers and 16 iterations 12.8682 (ms) Create+Copy+Synchronize+Destroy time for 2 streams and 5000 buffers and 8 iterations 12.799 (ms) Create+Copy+Synchronize+Destroy time for 4 streams and 5000 buffers and 4 iterations 12.815 (ms) Create+Copy+Synchronize+Destroy time for 8 streams and 5000 buffers and 2 iterations 12.8682 (ms)

jinz2014 avatar Apr 30 '24 16:04 jinz2014

@jinz2014 After a further investigation, the performance gap is introduced by SYCL runtime; my advice is that you can report this issue to sycl runtime at https://github.com/intel/llvm/issues.

tomflinda avatar May 08 '24 00:05 tomflinda

There is an update from https://github.com/oneapi-src/unified-runtime/issues/1589.

jinz2014 avatar May 15 '24 18:05 jinz2014

@jinz2014

  1. DPCT/SYCLomatic queues for USM already use in-order queue.
  2. The discard events cannot be enabled in the DPCT/SYCLomatic queues for USM, according to the definition of sycl_ext_oneapi_discard_queue_events, This extension adds ext::oneapi::property::queue::discard_events property for sycl::queue, by using this property the application informs a SYCL implementation that it will not use the event returned by any of the queue member functions. (i.e submit, parallel_for, copy, memset and others.) While DPCT/SYCLomatic needs to use event returns from the queue member functions to do sync operations in the migrated code.

tomflinda avatar May 21 '24 23:05 tomflinda

I ran the SYCL program with and without the discard queue event on an NVIDIA GPU. The discard queue event does not seem to have impact on the performance. You may run the SYCL (main.cpp) and DPCT (main-dpct.cpp) programs for performance comparison.

https://github.com/zjin-lcf/HeCBench/tree/master/src/streamCreateCopyDestroy-sycl

jinz2014 avatar May 23 '24 02:05 jinz2014

Hi @jinz2014, I have reproduced your performance result, which shows no difference between SYCL (main.cpp) and DPCT (main-dpct.cpp). So, the summary is that the performance gap is introduced by SYCL runtime.

tomflinda avatar Jun 05 '24 00:06 tomflinda

Let me clarify my comments. The execution time of 'main.cpp' is very close to that of the CUDA program 'main.cu'. However, the execution time of 'main-dpct.cpp' is not. Regardless of the discard event, there is a performance gap between main.cpp and main-dpct.cpp. I am not sure if this is introduced by the SYCL runtime.

jinz2014 avatar Jun 05 '24 14:06 jinz2014

@jinz2014 Based my experiment result: The performance gain in main.cpp is mainly from sycl::ext::oneapi::property::queue::discard_events() is used in https://github.com/zjin-lcf/HeCBench/blob/88c5bcb94f68003ba304b6d373f9a4641dee5e2e/src/streamCreateCopyDestroy-sycl/main.cpp#L78.

If we remove the sycl::ext::oneapi::property::queue::discard_events() in main.cpp, there will be no performance difference.

Could you try to rerun the test after removing sycl::ext::oneapi::property::queue::discard_events() in https://github.com/zjin-lcf/HeCBench/blob/88c5bcb94f68003ba304b6d373f9a4641dee5e2e/src/streamCreateCopyDestroy-sycl/main.cpp#L78?

If there is still a performance gap after sycl::ext::oneapi::property::queue::discard_events() is removed in main.cpp, pls provide your test env info, so that we can reproduce it in the same test env and do a further investigation.

tomflinda avatar Jun 17 '24 05:06 tomflinda

On a V100 GPU, the version of the CUDA compilation tools is 12.3:

The results of the CUDA program: Create+Copy+Synchronize+Destroy time for 1 streams and 5000 buffers and 16 iterations 15.6444 (ms) Create+Copy+Synchronize+Destroy time for 2 streams and 5000 buffers and 8 iterations 15.6255 (ms) Create+Copy+Synchronize+Destroy time for 4 streams and 5000 buffers and 4 iterations 15.6548 (ms) Create+Copy+Synchronize+Destroy time for 8 streams and 5000 buffers and 2 iterations 15.6429 (ms)

The SYCL program is built with the oneAPI 2024.2 with the prebuilt CUDA plugin from Codeplay.

The results of the SYCL program with the "discard_events" option: Create+Copy+Synchronize+Destroy time for 1 streams and 5000 buffers and 16 iterations 23.9135 (ms) Create+Copy+Synchronize+Destroy time for 2 streams and 5000 buffers and 8 iterations 23.9091 (ms) Create+Copy+Synchronize+Destroy time for 4 streams and 5000 buffers and 4 iterations 23.8957 (ms) Create+Copy+Synchronize+Destroy time for 8 streams and 5000 buffers and 2 iterations 23.8987 (ms)

The results of the SYCL program without the "discard_events" option: Create+Copy+Synchronize+Destroy time for 1 streams and 5000 buffers and 16 iterations 23.9317 (ms) Create+Copy+Synchronize+Destroy time for 2 streams and 5000 buffers and 8 iterations 23.9154 (ms) Create+Copy+Synchronize+Destroy time for 4 streams and 5000 buffers and 4 iterations 23.9224 (ms) Create+Copy+Synchronize+Destroy time for 8 streams and 5000 buffers and 2 iterations 23.9258 (ms)

jinz2014 avatar Jul 11 '24 15:07 jinz2014

@jinz2014 According to the description of discard_event in https://github.com/oneapi-src/unified-runtime/issues/1589 and our test result mentioned above, discard_event should impact performance. The only difference between your environment and mine is the CUDA plugin. Could you try to build and install the open-source version on https://github.com/intel/llvm?

tomflinda avatar Jul 15 '24 01:07 tomflinda

There is a performance impact in the open-source version.

jinz2014 avatar Jul 17 '24 18:07 jinz2014

Hi @jinz2014 Since discard_event does impact performance with open-source version. I suggest you report this issue to codeplay team on https://developer.codeplay.com/products/oneapi/nvidia/feedback/.

tomflinda avatar Sep 13 '24 03:09 tomflinda