CUDA.jl Prefix scan using warp primitives and fallback for non primitive types

Jul 11 '20 16:07 Ellipse0934

@maleadt I've been having some thoughts on this PR, mainly that this does not reflect the true definition of prefix scan. Scan is supposed to work on any binary associative operator and does not require the existence of identity operator which is required for our version. In our case we cannot get functions like xor, min, max to work due to no neutral element. Outside the scope of Base.accumulate our function should just be as capable as the Thrust version which flexible in this regard.

Jul 14 '20 13:07 Ellipse0934

Yeah, the same argument can be made for mapreduce. Makes it much easier to implement them though, as many textbook GPU implementations of these reductions use + and 0.

Jul 15 '20 15:07 maleadt

No longer requires a neutral element. It now computes exclusive scan which it didn't earlier.

Requires more tests for the new features but I don't know what kind of tests should I have for this. Should I write a CPU version and compare against that or can I make special cases and check for certain values.

Jul 16 '20 14:07 Ellipse0934

Fixed, no races detected anymore. The error was in my usage of @cuDynamicSharedMem.

# new version
julia> A = CUDA.ones(Int128, 100_000_000);

julia> B = similar(A);

julia> CUDA.@time CUDA.scan!(+, B, A, dims=1);
  0.152922 seconds (978 CPU allocations: 25.609 KiB) (2 GPU allocations: 2.654 MiB, 0.14% gc time of which 94.00% spent allocating)


julia> A = CUDA.ones(Float32, 100_000_000);

julia> B = similar(A);

julia> CUDA.@time CUDA.scan!(+, B, A);
  0.036084 seconds (616 CPU allocations: 19.922 KiB) (2 GPU allocations: 381.848 KiB, 0.03% gc time)

# current master branch
julia> A = CUDA.ones(Int128, 100_000_000);

julia> B = similar(A);

julia> CUDA.@time CUDA.scan!(+, B, A, dims=1);
  0.077795 seconds (743 CPU allocations: 24.125 KiB) (2 GPU allocations: 1.492 MiB, 0.35% gc time of which 95.43% spent allocating)


julia> A = CUDA.ones(Float32, 100_000_000);

julia> B = similar(A);

julia> CUDA.@time CUDA.scan!(+, B, A, dims=1);
  0.044769 seconds (630 CPU allocations: 20.281 KiB) (2 GPU allocations: 381.848 KiB, 0.33% gc time of which 91.34% spent allocating)

The cost of syncwarp() is too high, yet essential here to establish a memory fence. Maybe the current Belloch scan should be the default fallback. This PR should also needs additional tests for exclusive scan.

Jul 22 '20 21:07 Ellipse0934

The reason for the poor performance turned out to be missing @inbounds. Now the performance is closer to belloch albeit about 7% slower.

#current master

julia> @benchmark CUDA.@sync(CUDA.scan!(+, B, A; dims = 1)) setup=(A = CUDA.ones(Int128, 100_000_000); B = similar(A))
BenchmarkTools.Trial: 
  memory estimate:  17.81 KiB
  allocs estimate:  555
  --------------
  minimum time:     70.224 ms (0.00% GC)
  median time:      72.690 ms (0.00% GC)
  mean time:        73.293 ms (0.00% GC)
  maximum time:     77.850 ms (0.00% GC)
  --------------
  samples:          29
  evals/sample:     1


# new implementation

julia> @benchmark CUDA.@sync(CUDA.scan!(+, B, A; dims = 1)) setup=(A = CUDA.ones(Int128, 100_000_000); B = similar(A))
BenchmarkTools.Trial: 
  memory estimate:  20.39 KiB
  allocs estimate:  646
  --------------
  minimum time:     77.801 ms (0.00% GC)
  median time:      78.209 ms (0.00% GC)
  mean time:        79.301 ms (0.00% GC)
  maximum time:     85.532 ms (0.00% GC)
  --------------
  samples:          28
  evals/sample:     1

Is is worth it to keep Belloch for this minor improvement ?

How do I write tests for the exclusive version ? Since accumulate does not have an exclusive parameter and the current tests compare against a Base algorithm, should I write a CPU version in test/array.jl and test against that ?

Jul 27 '20 21:07 Ellipse0934

Did an oopsie and removed the inbounds from aggregate_partial_scan function. Now it's performing much better than the current version for both cases.

julia> @benchmark CUDA.@sync(CUDA.scan!(+, B, A; dims = 1)) setup=(A = CUDA.ones(Int128, 100_000_000); B = similar(A))
BenchmarkTools.Trial: 
  memory estimate:  20.27 KiB
  allocs estimate:  638
  --------------
  minimum time:     47.482 ms (0.00% GC)
  median time:      54.718 ms (0.00% GC)
  mean time:        55.158 ms (0.00% GC) # 1.4x faster
  maximum time:     59.364 ms (0.00% GC)
  --------------
  samples:          35
  evals/sample:     1

julia> @benchmark CUDA.@sync(CUDA.scan!(+, B, A; dims = 1)) setup=(A = CUDA.ones(Float32, 100_000_000); B = similar(A))
BenchmarkTools.Trial: 
  memory estimate:  19.80 KiB
  allocs estimate:  608
  --------------
  minimum time:     17.683 ms (0.00% GC)
  median time:      19.859 ms (0.00% GC)
  mean time:        19.917 ms (0.00% GC)  # 2x faster
  maximum time:     22.436 ms (0.00% GC)
  --------------
  samples:          152
  evals/sample:     1

Aug 02 '20 13:08 Ellipse0934

I'm seeing lower, but nonetheless nice improvements :-) I'm a bit wary not all corner cases are adequately tested (e.g. exclusive=true), but we can improve that as we go. I've pushed some NFC improvements and a rebase, let's run a final test and merge this: bors r+

Aug 13 '20 11:08 maleadt

Actually, let's merge this manually when performance tests on master have finished. bors r-

Aug 13 '20 12:08 maleadt

Canceled.

Aug 13 '20 12:08 bors[bot]

bors try

Aug 13 '20 12:08 maleadt

try

Build failed:

ci/gitlab/gitlab.com

Aug 13 '20 15:08 bors[bot]

Well, that's interesting:

  Test threw exception
  Expression: testf((x->begin
            accumulate(+, x)
        end), rand(Int128, n))
  CUDA error: an illegal instruction was encountered (code 715, ERROR_ILLEGAL_INSTRUCTION)

@Ellipse0934 did you run into any of these?

Aug 13 '20 15:08 maleadt

No, I tested on this branch again to be sure and all tests have passed successfully. My tests: https://gist.github.com/Ellipse0934/2db29bc0602641c4cccd58aadb64fd93

Aug 13 '20 18:08 Ellipse0934

Well, there's definitely something up, although it looks like it only manifests on recent hardware (sm_70 or higher). Looks like https://github.com/kokkos/kokkos/issues/1958, with cuda-memcheck pointing to the warp sync:

array      (2) |        started at 2020-08-14T11:04:40.734
      From worker 2:    ========= CUDA-MEMCHECK
      From worker 2:    ========= Illegal Instruction
      From worker 2:    =========     at 0x00000030 in __cuda_sm70_warpsync
      From worker 2:    =========     by thread (31,0,0) in block (0,0,0)
      From worker 2:    =========     Device Frame:/home/tbesard/Julia/pkg/LLVM/src/interop/base.jl:53:_Z25julia_partial_scan__205832__13CuDeviceArrayI6Int128Li1E6GlobalES0_IS1_Li1ES2_Ev16CartesianIndicesILi1E5TupleI5OneToI5Int64EEES3_ILi0ES4_ES3_ILi0ES4_ES3_ILi2ES4_IS5_IS6_ES5_IS6_EEEv3ValILitrueEES7_ILifalseEE (_Z25julia_partial_scan__205832__13CuDeviceArrayI6Int128Li1E6GlobalES0_IS1_Li1ES2_Ev16CartesianIndicesILi1E5TupleI5OneToI5Int64EEES3_ILi0ES4_ES3_ILi0ES4_ES3_ILi2ES4_IS5_IS6_ES5_IS6_EEEv3ValILitrueEES7_ILifalseEE : 0x3f60)

Running with synccheck:

      From worker 2:    ========= Barrier error detected. Invalid arguments
      From worker 2:    =========     at 0x00000030 in __cuda_sm70_warpsync
      From worker 2:    =========     by thread (31,0,0) in block (0,0,0)
      From worker 2:    =========     Device Frame:__cuda_sm70_warpsync (__cuda_sm70_warpsync : 0x30)
      From worker 2:    =========     Device Frame:/home/tbesard/Julia/pkg/LLVM/src/interop/base.jl:53:_Z25julia_partial_scan__205822__13CuDeviceArrayI6Int128Li1E6GlobalES0_IS1_Li1ES2_Ev16CartesianIndicesILi1E5TupleI5OneToI5Int64EEES3_ILi0ES4_ES3_ILi0ES4_ES3_ILi2ES4_IS5_IS6_ES5_IS6_EEEv3ValILitrueEES7_ILifalseEE (_Z25julia_partial_scan__205822__13CuDeviceArrayI6Int128Li1E6GlobalES0_IS1_Li1ES2_Ev16CartesianIndicesILi1E5TupleI5OneToI5Int64EEES3_ILi0ES4_ES3_ILi0ES4_ES3_ILi2ES4_IS5_IS6_ES5_IS6_EEEv3ValILitrueEES7_ILifalseEE : 0x3f60)

Aug 14 '20 09:08 maleadt

i'd love to try this PR out as findall is my current bottleneck. are there any plans to resolve the conflicts? i have a variety of GPUs i could help test with (v100, 2080ti, rtx8000, m6000, 980ti, 1080ti).

Feb 08 '21 16:02 bjarthur

Rebased. But this needs some fixes due to the synchronization issues.

Feb 08 '21 17:02 maleadt