oneDPL icon indicating copy to clipboard operation
oneDPL copied to clipboard

compare reduce_by_segment using oneDPL and Thrust

Open zjin-lcf opened this issue 3 years ago • 1 comments

I tried to call oneDPL USM and Thrust functions for segment reduction. The pointers (d_keys, d_in, d_keys_out, and d_out) point to device memory.

    for (int i = 0; i < repeat; i++)
      oneapi::dpl::reduce_by_segment(policy, d_keys, d_keys + num_elements, d_in,
                            d_keys_out, d_out);


    for (int i = 0; i < repeat; i++)
      thrust::reduce_by_key(thrust::device, d_keys, d_keys + num_elements, d_in,
                            d_keys_out, d_out);

I measure the execution time of the above code snippet. The performance results on an NVIDIA V100 GPU shows significant difference. If this is not what you observed, please let me know. Thank you.

oneDPL
num_elements = 268435456
num_segments = 16777216 segment_size = 16 Throughput = 2.070432 (G/s)
num_segments = 8388608 segment_size = 32 Throughput = 2.084561 (G/s)
num_segments = 4194304 segment_size = 64 Throughput = 2.091851 (G/s)
num_segments = 2097152 segment_size = 128 Throughput = 2.094333 (G/s)
num_segments = 1048576 segment_size = 256 Throughput = 2.089707 (G/s)
num_segments = 524288 segment_size = 512 Throughput = 2.094500 (G/s)
num_segments = 262144 segment_size = 1024 Throughput = 2.097956 (G/s)
num_segments = 131072 segment_size = 2048 Throughput = 2.099221 (G/s)
num_segments = 65536 segment_size = 4096 Throughput = 2.095594 (G/s)
num_segments = 32768 segment_size = 8192 Throughput = 2.100392 (G/s)
num_segments = 16384 segment_size = 16384 Throughput = 2.100950 (G/s)
Thrust
num_elements = 268435456
num_segments = 16777216 segment_size = 16 Throughput = 65.469612 (G/s)
num_segments = 8388608 segment_size = 32 Throughput = 71.512650 (G/s)
num_segments = 4194304 segment_size = 64 Throughput = 77.745247 (G/s)
num_segments = 2097152 segment_size = 128 Throughput = 80.560936 (G/s)
num_segments = 1048576 segment_size = 256 Throughput = 80.120583 (G/s)
num_segments = 524288 segment_size = 512 Throughput = 83.196640 (G/s)
num_segments = 262144 segment_size = 1024 Throughput = 81.201439 (G/s)
num_segments = 131072 segment_size = 2048 Throughput = 79.137405 (G/s)
num_segments = 65536 segment_size = 4096 Throughput = 84.549011 (G/s)
num_segments = 32768 segment_size = 8192 Throughput = 82.928703 (G/s)
num_segments = 16384 segment_size = 16384 Throughput = 95.465546 (G/s)

zjin-lcf avatar Sep 12 '22 01:09 zjin-lcf

Hi,

Thank you for the data. We've recently worked to improve reduce_by_segment performance. That is available in https://github.com/oneapi-src/oneDPL/pull/608 . Pending it's review and merge to main would you please take a look at it to see if it addresses the difference you're seeing?

timmiesmith avatar Sep 12 '22 13:09 timmiesmith

@zjin-lcf Did you remember if oneDPL runs were on V100. I was seeing some built issue with CUDA backend (for complex types, reported here) and just wanted to check with you.

abagusetty avatar Feb 13 '23 15:02 abagusetty

Yes, on V100. However, I didn't evaluate the performance of the functions for complex types. Is there a reproducer for https://github.com/intel/llvm/issues/8281 ?

zjin-lcf avatar Feb 13 '23 17:02 zjin-lcf

I was hitting the intel/llvm#8281 issue when building oneDPL for CUDA backend. Not from a test-case.

abagusetty avatar Feb 13 '23 19:02 abagusetty

Did you clone the oneDPL repo and then specify "clang++ -I ./oneDPL/include -I./oneTBB/include ... " ?

zjin-lcf avatar Feb 13 '23 20:02 zjin-lcf

Did you clone the oneDPL repo and then specify "clang++ -I ./oneDPL/include -I./oneTBB/include ... " ?

I was actually referring to the build of oneDPL repo itself and running unit-tests inside the repo, without TBB backend (just serial)

cmake .. -DCMAKE_CXX_COMPILER=clang++ -DONEDPL_BACKEND=dpcpp_only -DCMAKE_BUILD_TYPE=Release -DONEDPL_USE_UNNAMED_LAMBDA=ON -DCMAKE_INSTALL_PREFIX=$PWD/../install_oneapi_PrgEnvgnu -DCMAKE_CXX_FLAGS="-fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_80 -fmath-errno -ffast-math"

abagusetty avatar Feb 13 '23 20:02 abagusetty

@timmiesmith

Just an update after rerunning the example. The throughput reaches ~10.2 G/s.

zjin-lcf avatar Jun 08 '23 19:06 zjin-lcf

@zjin-lcf would you please let me know which commit of oneDPL you're using? #862 was merged recently to improve reduce_by_segment performance. This is still an algorithm we're working to improve, and I want to confirm the improvent you're seeing is from the recent PR merge.

timmiesmith avatar Jun 09 '23 19:06 timmiesmith

In the oneDPL directory, "git log" shows

commit c697fac0b51ce2a36f3824bb9063dfaf6aee88ac (HEAD -> main, origin/release/2022.2, origin/main, origin/HEAD)
Author: Dan Hoeflinger <[email protected]>
Date:   Tue Jun 6 14:02:14 2023 -0400

Thanks.

zjin-lcf avatar Jun 09 '23 19:06 zjin-lcf

Thank you. This does include the recent reduce_by_segment performance improvements.

timmiesmith avatar Jun 09 '23 19:06 timmiesmith