Prefix scan using warp primitives and fallback for non primitive types
@maleadt
I've been having some thoughts on this PR, mainly that this does not reflect the true definition of prefix scan. Scan is supposed to work on any binary associative operator and does not require the existence of identity operator which is required for our version. In our case we cannot get functions like xor, min, max to work due to no neutral element.
Outside the scope of Base.accumulate our function should just be as capable as the Thrust version which flexible in this regard.
Yeah, the same argument can be made for mapreduce. Makes it much easier to implement them though, as many textbook GPU implementations of these reductions use + and 0.
No longer requires a neutral element. It now computes exclusive scan which it didn't earlier.
Requires more tests for the new features but I don't know what kind of tests should I have for this. Should I write a CPU version and compare against that or can I make special cases and check for certain values.
Fixed, no races detected anymore. The error was in my usage of @cuDynamicSharedMem.
# new version
julia> A = CUDA.ones(Int128, 100_000_000);
julia> B = similar(A);
julia> CUDA.@time CUDA.scan!(+, B, A, dims=1);
0.152922 seconds (978 CPU allocations: 25.609 KiB) (2 GPU allocations: 2.654 MiB, 0.14% gc time of which 94.00% spent allocating)
julia> A = CUDA.ones(Float32, 100_000_000);
julia> B = similar(A);
julia> CUDA.@time CUDA.scan!(+, B, A);
0.036084 seconds (616 CPU allocations: 19.922 KiB) (2 GPU allocations: 381.848 KiB, 0.03% gc time)
# current master branch
julia> A = CUDA.ones(Int128, 100_000_000);
julia> B = similar(A);
julia> CUDA.@time CUDA.scan!(+, B, A, dims=1);
0.077795 seconds (743 CPU allocations: 24.125 KiB) (2 GPU allocations: 1.492 MiB, 0.35% gc time of which 95.43% spent allocating)
julia> A = CUDA.ones(Float32, 100_000_000);
julia> B = similar(A);
julia> CUDA.@time CUDA.scan!(+, B, A, dims=1);
0.044769 seconds (630 CPU allocations: 20.281 KiB) (2 GPU allocations: 381.848 KiB, 0.33% gc time of which 91.34% spent allocating)
The cost of syncwarp() is too high, yet essential here to establish a memory fence. Maybe the current Belloch scan should be the default fallback. This PR should also needs additional tests for exclusive scan.
The reason for the poor performance turned out to be missing @inbounds. Now the performance is closer to belloch albeit about 7% slower.
#current master
julia> @benchmark CUDA.@sync(CUDA.scan!(+, B, A; dims = 1)) setup=(A = CUDA.ones(Int128, 100_000_000); B = similar(A))
BenchmarkTools.Trial:
memory estimate: 17.81 KiB
allocs estimate: 555
--------------
minimum time: 70.224 ms (0.00% GC)
median time: 72.690 ms (0.00% GC)
mean time: 73.293 ms (0.00% GC)
maximum time: 77.850 ms (0.00% GC)
--------------
samples: 29
evals/sample: 1
# new implementation
julia> @benchmark CUDA.@sync(CUDA.scan!(+, B, A; dims = 1)) setup=(A = CUDA.ones(Int128, 100_000_000); B = similar(A))
BenchmarkTools.Trial:
memory estimate: 20.39 KiB
allocs estimate: 646
--------------
minimum time: 77.801 ms (0.00% GC)
median time: 78.209 ms (0.00% GC)
mean time: 79.301 ms (0.00% GC)
maximum time: 85.532 ms (0.00% GC)
--------------
samples: 28
evals/sample: 1
Is is worth it to keep Belloch for this minor improvement ?
How do I write tests for the exclusive version ? Since accumulate does not have an exclusive parameter and the current tests compare against a Base algorithm, should I write a CPU version in test/array.jl and test against that ?
Did an oopsie and removed the inbounds from aggregate_partial_scan function. Now it's performing much better than the current version for both cases.
julia> @benchmark CUDA.@sync(CUDA.scan!(+, B, A; dims = 1)) setup=(A = CUDA.ones(Int128, 100_000_000); B = similar(A))
BenchmarkTools.Trial:
memory estimate: 20.27 KiB
allocs estimate: 638
--------------
minimum time: 47.482 ms (0.00% GC)
median time: 54.718 ms (0.00% GC)
mean time: 55.158 ms (0.00% GC) # 1.4x faster
maximum time: 59.364 ms (0.00% GC)
--------------
samples: 35
evals/sample: 1
julia> @benchmark CUDA.@sync(CUDA.scan!(+, B, A; dims = 1)) setup=(A = CUDA.ones(Float32, 100_000_000); B = similar(A))
BenchmarkTools.Trial:
memory estimate: 19.80 KiB
allocs estimate: 608
--------------
minimum time: 17.683 ms (0.00% GC)
median time: 19.859 ms (0.00% GC)
mean time: 19.917 ms (0.00% GC) # 2x faster
maximum time: 22.436 ms (0.00% GC)
--------------
samples: 152
evals/sample: 1
I'm seeing lower, but nonetheless nice improvements :-) I'm a bit wary not all corner cases are adequately tested (e.g. exclusive=true), but we can improve that as we go. I've pushed some NFC improvements and a rebase, let's run a final test and merge this: bors r+
Actually, let's merge this manually when performance tests on master have finished.
bors r-
Canceled.
bors try
Well, that's interesting:
Test threw exception
Expression: testf((x->begin
accumulate(+, x)
end), rand(Int128, n))
CUDA error: an illegal instruction was encountered (code 715, ERROR_ILLEGAL_INSTRUCTION)
@Ellipse0934 did you run into any of these?
No, I tested on this branch again to be sure and all tests have passed successfully. My tests: https://gist.github.com/Ellipse0934/2db29bc0602641c4cccd58aadb64fd93
Well, there's definitely something up, although it looks like it only manifests on recent hardware (sm_70 or higher). Looks like https://github.com/kokkos/kokkos/issues/1958, with cuda-memcheck pointing to the warp sync:
array (2) | started at 2020-08-14T11:04:40.734
From worker 2: ========= CUDA-MEMCHECK
From worker 2: ========= Illegal Instruction
From worker 2: ========= at 0x00000030 in __cuda_sm70_warpsync
From worker 2: ========= by thread (31,0,0) in block (0,0,0)
From worker 2: ========= Device Frame:/home/tbesard/Julia/pkg/LLVM/src/interop/base.jl:53:_Z25julia_partial_scan__205832__13CuDeviceArrayI6Int128Li1E6GlobalES0_IS1_Li1ES2_Ev16CartesianIndicesILi1E5TupleI5OneToI5Int64EEES3_ILi0ES4_ES3_ILi0ES4_ES3_ILi2ES4_IS5_IS6_ES5_IS6_EEEv3ValILitrueEES7_ILifalseEE (_Z25julia_partial_scan__205832__13CuDeviceArrayI6Int128Li1E6GlobalES0_IS1_Li1ES2_Ev16CartesianIndicesILi1E5TupleI5OneToI5Int64EEES3_ILi0ES4_ES3_ILi0ES4_ES3_ILi2ES4_IS5_IS6_ES5_IS6_EEEv3ValILitrueEES7_ILifalseEE : 0x3f60)
Running with synccheck:
From worker 2: ========= Barrier error detected. Invalid arguments
From worker 2: ========= at 0x00000030 in __cuda_sm70_warpsync
From worker 2: ========= by thread (31,0,0) in block (0,0,0)
From worker 2: ========= Device Frame:__cuda_sm70_warpsync (__cuda_sm70_warpsync : 0x30)
From worker 2: ========= Device Frame:/home/tbesard/Julia/pkg/LLVM/src/interop/base.jl:53:_Z25julia_partial_scan__205822__13CuDeviceArrayI6Int128Li1E6GlobalES0_IS1_Li1ES2_Ev16CartesianIndicesILi1E5TupleI5OneToI5Int64EEES3_ILi0ES4_ES3_ILi0ES4_ES3_ILi2ES4_IS5_IS6_ES5_IS6_EEEv3ValILitrueEES7_ILifalseEE (_Z25julia_partial_scan__205822__13CuDeviceArrayI6Int128Li1E6GlobalES0_IS1_Li1ES2_Ev16CartesianIndicesILi1E5TupleI5OneToI5Int64EEES3_ILi0ES4_ES3_ILi0ES4_ES3_ILi2ES4_IS5_IS6_ES5_IS6_EEEv3ValILitrueEES7_ILifalseEE : 0x3f60)
i'd love to try this PR out as findall is my current bottleneck. are there any plans to resolve the conflicts? i have a variety of GPUs i could help test with (v100, 2080ti, rtx8000, m6000, 980ti, 1080ti).
Rebased. But this needs some fixes due to the synchronization issues.