Weird behaviour of mean function in CuArray
Describe the bug
Sometimes when using a CuArray, I'm having some weird behaviors when using some Flux normalizations. Trying to find the problem, I saw that this happens (at least) when using Statistics mean function. If I execute it on the CPU it always give the expected values, but on GPU sometimes it gives this strange behavior.
To reproduce
The Minimal Working Example (MWE) for this bug:
julia> using CUDA, Statistics
julia> a = CUDA.ones((640, 640, 32, 1))
julia> b = mean(a; dims=[1])
1×640×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:
[:, :, 1, 1] =
0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 … 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994
[:, :, 2, 1] =
0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 … 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994
[:, :, 3, 1] =
0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 … 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994
;;; …
[:, :, 30, 1] =
0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 … 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994
[:, :, 31, 1] =
0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 … 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994
[:, :, 32, 1] =
0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 … 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994
julia> b = mean(a |> cpu; dims=[1])
1×640×32×1 Array{Float32, 4}:
[:, :, 1, 1] =
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 … 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
[:, :, 2, 1] =
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 … 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
[:, :, 3, 1] =
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 … 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
;;; …
[:, :, 30, 1] =
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 … 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
[:, :, 31, 1] =
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 … 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
[:, :, 32, 1] =
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 … 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
Manifest.toml
[[deps.CUDA]]
deps = ["AbstractFFTs", "Adapt", "BFloat16s", "CEnum", "CUDA_Driver_jll", "CUDA_Runtime_Discovery", "CUDA_Runtime_jll", "CompilerSupportLibraries_jll", "ExprTools", "GPUArrays", "GPUCompiler", "LLVM", "LazyArtifacts", "Libdl", "LinearAlgebra", "Logging", "Preferences", "Printf", "Random", "Random123", "RandomNumbers", "Reexport", "Requires", "SparseArrays", "SpecialFunctions"]
git-tree-sha1 = "edff14c60784c8f7191a62a23b15a421185bc8a8"
uuid = "052768ef-5323-5732-b1bb-66c8b64840ba"
version = "4.0.1"
[[deps.CUDA_Driver_jll]]
deps = ["Artifacts", "JLLWrappers", "LazyArtifacts", "Libdl", "Pkg"]
git-tree-sha1 = "75d7896d1ec079ef10d3aee8f3668c11354c03a1"
uuid = "4ee394cb-3365-5eb0-8335-949819d2adfc"
version = "0.2.0+0"
[[deps.CUDA_Runtime_Discovery]]
deps = ["Libdl"]
git-tree-sha1 = "58dd8ec29f54f08c04b052d2c2fa6760b4f4b3a4"
uuid = "1af6417a-86b4-443c-805f-a4643ffb695f"
version = "0.1.1"
[[deps.CUDA_Runtime_jll]]
deps = ["Artifacts", "CUDA_Driver_jll", "JLLWrappers", "LazyArtifacts", "Libdl", "Pkg", "TOML"]
git-tree-sha1 = "d3e6ccd30f84936c1a3a53d622d85d7d3f9b9486"
uuid = "76a88914-d11a-5bdc-97e0-2f5a05c973a2"
version = "0.2.3+2"
[[deps.CUDNN_jll]]
deps = ["Artifacts", "CUDA_Runtime_jll", "JLLWrappers", "LazyArtifacts", "Libdl", "Pkg", "TOML"]
git-tree-sha1 = "57011df4fce448828165e566af9befa2ea94350a"
uuid = "62b44479-cb7b-5706-934f-f13b2eb2e645"
version = "8.6.0+3"
[[deps.AbstractFFTs]]
deps = ["ChainRulesCore", "LinearAlgebra"]
git-tree-sha1 = "69f7020bd72f069c219b5e8c236c1fa90d2cb409"
uuid = "621f4979-c628-5d54-868e-fcf4e3e8185c"
version = "1.2.1"
[[deps.Adapt]]
deps = ["LinearAlgebra"]
git-tree-sha1 = "0310e08cb19f5da31d08341c6120c047598f5b9c"
uuid = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
version = "3.5.0"
[[deps.BFloat16s]]
deps = ["LinearAlgebra", "Printf", "Random", "Test"]
git-tree-sha1 = "dbf84058d0a8cbbadee18d25cf606934b22d7c66"
uuid = "ab4f0b2a-ad5b-11e8-123f-65d77653426b"
version = "0.4.2"
[[deps.CEnum]]
git-tree-sha1 = "eb4cb44a499229b3b8426dcfb5dd85333951ff90"
uuid = "fa961155-64e5-5f13-b03f-caf6b980ea82"
version = "0.4.2"
[[deps.GPUArrays]]
deps = ["Adapt", "GPUArraysCore", "LLVM", "LinearAlgebra", "Printf", "Random", "Reexport", "Serialization", "Statistics"]
git-tree-sha1 = "4dfaff044eb2ce11a897fecd85538310e60b91e6"
uuid = "0c68f7d7-f131-5f86-a1c3-88cf8149b2d7"
version = "8.6.2"
[[deps.GPUArraysCore]]
deps = ["Adapt"]
git-tree-sha1 = "57f7cde02d7a53c9d1d28443b9f11ac5fbe7ebc9"
uuid = "46192b85-c4d5-4398-a991-12ede77f4527"
version = "0.1.3"
[[deps.GPUCompiler]]
deps = ["ExprTools", "InteractiveUtils", "LLVM", "Libdl", "Logging", "TimerOutputs", "UUIDs"]
git-tree-sha1 = "95185985a5d2388c6d0fedb06181ad4ddd40e0cb"
uuid = "61eb1bfa-7361-4325-ad38-22787b887f55"
version = "0.17.2"
[[deps.LLVM]]
deps = ["CEnum", "LLVMExtra_jll", "Libdl", "Printf", "Unicode"]
git-tree-sha1 = "df115c31f5c163697eede495918d8e85045c8f04"
uuid = "929cbde3-209d-540e-8aea-75f648917ca0"
version = "4.16.0"
Version info
Details on Julia: Julia 1.8.2
julia> versioninfo()
Julia Version 1.8.2
Commit 36034abf26 (2022-09-29 15:21 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 16 × AMD Ryzen 7 4800H with Radeon Graphics
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, znver2)
Threads: 4 on 16 virtual cores
Environment:
JULIA_EDITOR = code
JULIA_NUM_THREADS = 4
Details on CUDA:
julia> CUDA.versioninfo()
CUDA runtime 11.8, artifact installation
CUDA driver 12.0
NVIDIA driver 527.56.0
Libraries:
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 12.0.0+527.56
Toolchain:
- Julia: 1.8.2
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86
1 device:
0: NVIDIA GeForce GTX 1650 Ti (sm_75, 2.901 GiB / 4.000 GiB available)
Hmm, that's not enough information to help you.
I see EDITOR=code in there; are you execution this from the VScode terminal? If so, https://github.com/JuliaGPU/CUDA.jl/issues/875
Same happens in a REPL outside VSCode. I think the problem may be the Statistics implementantion of the mean function, but is works pretty well when using the CPU.
Example:
julia> A = CUDA.ones((640, 640, 32, 1))
julia> B = ones((640, 640, 32, 1))
julia> mean(A; dims=1)
1×640×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:
[:, :, 1, 1] =
0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 … 0.999994 0.999994 0.999994 0.999994 0.999994
[:, :, 2, 1] =
0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 … 0.999994 0.999994 0.999994 0.999994 0.999994
[:, :, 3, 1] =
0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 … 0.999994 0.999994 0.999994 0.999994 0.999994
;;; …
[:, :, 30, 1] =
0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 … 0.999994 0.999994 0.999994 0.999994 0.999994
[:, :, 31, 1] =
0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 … 0.999994 0.999994 0.999994 0.999994 0.999994
[:, :, 32, 1] =
0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 … 0.999994 0.999994 0.999994 0.999994 0.999994
julia> mean(A; dims=1)
1×640×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:
[:, :, 1, 1] =
0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 … 0.999994 0.999994 0.999994 0.999994 0.999994
[:, :, 2, 1] =
0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 … 0.999994 0.999994 0.999994 0.999994 0.999994
[:, :, 3, 1] =
0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 … 0.999994 0.999994 0.999994 0.999994 0.999994
;;; …
[:, :, 30, 1] =
0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 … 0.999994 0.999994 0.999994 0.999994 0.999994
[:, :, 31, 1] =
0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 … 0.999994 0.999994 0.999994 0.999994 0.999994
[:, :, 32, 1] =
0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 … 0.999994 0.999994 0.999994 0.999994 0.999994
julia> mean(A; dims=[1,2])
1×1×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:
[:, :, 1, 1] =
1.0000081
[:, :, 2, 1] =
1.0000081
[:, :, 3, 1] =
1.0000081
;;; …
[:, :, 30, 1] =
1.0000081
[:, :, 31, 1] =
1.0000081
[:, :, 32, 1] =
1.0000081
julia> mean(A; dims=[1,2,3])
1×1×1×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:
[:, :, 1, 1] =
1.0000155
julia> mean(B; dims=[1,2,3])
1×1×1×1 Array{Float64, 4}:
[:, :, 1, 1] =
1.0
julia> versioninfo()
Julia Version 1.8.2
Commit 36034abf26 (2022-09-29 15:21 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 16 × AMD Ryzen 7 4800H with Radeon Graphics
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, znver2)
Threads: 1 on 16 virtual cores
I think I found maybe who is causing this, the Statistics mean function uses the _mean that is override in GPUArrays. function Statistics._mean(f, A::AbstractGPUArray, dims)
It seems when you do the type conversion from the inverse of the size of the reduced dimensions (Float64) to Float32 (in the situation where you have a Float32 CuArray) there is a small error that creates the behavior I mentioned above.
I executed this _mean function statement by statement and found this:
julia> A = CUDA.ones((640, 640, 32, 1))
640×640×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:
julia> T = float(eltype(A))
Float32
julia> λ = convert(T, inv(_mean_denom(A, dims)))
0.0015625f0
julia> sum(Base.Fix1(*,λ), A; dims)
1×640×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:
[:, :, 1, 1] =
0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 … 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994
[:, :, 2, 1] =
0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 … 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994
[:, :, 3, 1] =
0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 … 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994
;;; …
[:, :, 30, 1] =
0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 … 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994
[:, :, 31, 1] =
0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 … 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994
[:, :, 32, 1] =
0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 … 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994 0.999994
But with a smaller Array this does not happen.
julia> A = CUDA.ones((320, 320, 32, 1))
320×320×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:
julia> T = float(eltype(A))
Float32
julia> λ = convert(T, inv(_mean_denom(A, dims)))
0.003125f0
julia> sum(Base.Fix1(*,λ), A; dims)
1×320×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:
[:, :, 1, 1] =
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 … 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
[:, :, 2, 1] =
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 … 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
[:, :, 3, 1] =
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 … 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
;;; …
[:, :, 30, 1] =
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 … 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
[:, :, 31, 1] =
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 … 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
[:, :, 32, 1] =
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 … 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
@mcabbott can you also check this?
I don't see this with the size above:
julia> size(a)
(640, 640, 32, 1)
julia> mean(a; dims=1)[1:2]
2-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
1.0
1.0
(@v1.10) pkg> st CUDA GPUArrays
Status `~/.julia/environments/v1.10/Project.toml`
[052768ef] CUDA v4.0.1
[0c68f7d7] GPUArrays v8.6.2
julia> nextfloat(0.999994f0, 101)
1.0f0
What is the actual value you get? The compact printing may have lost some digits from 0.999994.
julia> mean(A; dims=1)[1:2]
2-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
0.9999937
0.9999937
(@v1.8) pkg> st CUDA GPUArrays
Project YOLOv7 v0.1.0
Status `C:\Users\gabri\.julia\environments\v1.8\Project.toml`
[052768ef] CUDA v4.0.1
[0c68f7d7] GPUArrays v8.6.2
julia> nextfloat(0.9999937f0, 101)
0.9999997f0
I made the array larger just to check, and look:
julia> size(A)
(1280, 1280, 32, 1)
julia> mean(A; dims=1)[1:2]
2-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
0.9999892
0.9999892
I notice this in my Flux BatchNorm layers, they were producing some wrong values, and that's how I ended up seeing this problem with larger arrays.
Another example with smaller values.
julia> B = CUDA.fill(1.0f-5, (640, 640, 32, 1))
640×640×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:
julia> mean(B; dims=1)[1:2]
2-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
1.0000034f-5
1.0000034f-5
Some accuracy experiments (on CPU):
julia> function countepsfrom(x::T, xtrue) where {T<:AbstractFloat}
target = T(xtrue)
for n in Iterators.flatten(zip(0:100, -1:-1:-100))
nextfloat(x, n) === target && return n
end
return (target - x) / eps(x)
end;
julia> mean1(x, λ=convert(eltype(x), inv(length(x)))) = sum(Base.Fix1(*,λ), x); # like CUDA at present
julia> mean1(ones(Float32, 640))
0.9999992f0
julia> countepsfrom(ans, 1)
13
julia> mean2(x, l=length(x)) = sum(Base.Fix2(/,l), x); # divide instead, same result
julia> mean2(ones(Float32, 640))
0.9999992f0
julia> countepsfrom(ans, 1)
13
julia> mean3(x) = sum(x)/length(x); # naiive sum then divide
julia> mean3(ones(Float32, 640))
1.0f0
Float16 to look for overflow:
julia> mean1(ones(Float16, 10^5))
Float16(1.002)
julia> countepsfrom(ans, 1)
-2
julia> mean2(ones(Float16, 10^5))
Float16(0.0)
julia> mean3(ones(Float16, 10^5))
NaN16
julia> x = rand(Float16, 10^5);
julia> xbar = mean(big.(x)); # true result
julia> countepsfrom(mean1(x), xbar)
-2
julia> countepsfrom(mean2(x), xbar)
Inf16
julia> countepsfrom(mean3(x), xbar)
Inf16
The solution I proposed in https://github.com/JuliaGPU/GPUArrays.jl/pull/453 will fail with the Float16 but not for Float32.
julia> mean4(x, λ=convert(eltype(x), inv(length(x)))) = sum(x) .* λ;
julia> mean4(ones(Float32, 640))
1.0f0
julia> mean4(ones(Float16, 10^5))
Inf16
julia> x = rand(Float16, 10^5);
julia> xbar = mean(big.(x));
julia> countepsfrom(mean4(x), xbar)
-2
julia> mean1(ones(Float32, 10^9))
0.99999976f0
julia> countepsfrom(ans, 1)
4
julia> mean4(ones(Float32, 10^9))
1.0f0
I never worked with Float16, so I can't imagine how to deal with it in these operations.
The reason this doesn't reproduce consistently is the big_mapreduce_threshold cutoff that mapreduce uses.
I noticed very strange behaviour for the summation as well (mean behaves the same for me) where it starts producing completely off results:
julia> test_array = CuArray(rand(Float32, 50_000, 2, 16));
julia> results = map(_ -> sum(test_array, dims=1), 1:10)
[25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]
[25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]
[25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]
[25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]
[25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]
[25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]
[25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]
[25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]
[25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]
[25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]
julia> results = map(_ -> sum(test_array, dims=1), 1:10)
10-element Vector{CuArray{Float32, 3, CUDA.DeviceMemory}}:
[75032.81 74824.22;;; 74833.0 74942.375;;; 74767.8 74971.5;;; … ;;; 75398.83 75095.41;;; 75081.91 75143.46;;; 74982.7 75084.875]
[75032.81 74824.22;;; 74833.0 74942.375;;; 74767.8 74971.5;;; … ;;; 75398.83 75095.41;;; 75081.91 75143.46;;; 74982.7 75084.875]
[75032.81 74824.22;;; 74833.0 74942.375;;; 74767.8 74971.5;;; … ;;; 75398.83 75095.41;;; 75081.91 75143.46;;; 74982.7 75084.875]
[75032.81 74824.22;;; 74833.0 74942.375;;; 74767.8 74971.5;;; … ;;; 75398.83 75095.41;;; 75081.91 75143.46;;; 74982.7 75084.875]
[75032.81 74824.22;;; 74833.0 74942.375;;; 74767.8 74971.5;;; … ;;; 75398.83 75095.41;;; 75081.91 75143.46;;; 74982.7 75084.875]
[25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]
[25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]
[25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]
[25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]
[25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]
What is interesting is dims=2 and dims=3 are entirely unaffected, also cannot reproduce if the tensor is smaller, e.g. (500, 2, 16)
That's unrelated; there was a bug with mapreduce that's been fixed on #master.