CUDA.jl Weird behaviour of mean function in CuArray

Describe the bug

Sometimes when using a CuArray, I'm having some weird behaviors when using some Flux normalizations. Trying to find the problem, I saw that this happens (at least) when using Statistics mean function. If I execute it on the CPU it always give the expected values, but on GPU sometimes it gives this strange behavior.

To reproduce

The Minimal Working Example (MWE) for this bug:

julia> using CUDA, Statistics

julia> a = CUDA.ones((640, 640, 32, 1))

julia> b = mean(a; dims=[1])
1×640×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:
[:, :, 1, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994   

[:, :, 2, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994   

[:, :, 3, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994

;;; …

[:, :, 30, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994   

[:, :, 31, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994   

[:, :, 32, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994   

julia> b = mean(a |> cpu; dims=[1])
1×640×32×1 Array{Float32, 4}:
[:, :, 1, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0   

[:, :, 2, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0   

[:, :, 3, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0   

;;; …

[:, :, 30, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

[:, :, 31, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0   

[:, :, 32, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

Manifest.toml

[[deps.CUDA]]
deps = ["AbstractFFTs", "Adapt", "BFloat16s", "CEnum", "CUDA_Driver_jll", "CUDA_Runtime_Discovery", "CUDA_Runtime_jll", "CompilerSupportLibraries_jll", "ExprTools", "GPUArrays", "GPUCompiler", "LLVM", "LazyArtifacts", "Libdl", "LinearAlgebra", "Logging", "Preferences", "Printf", "Random", "Random123", "RandomNumbers", "Reexport", "Requires", "SparseArrays", "SpecialFunctions"]
git-tree-sha1 = "edff14c60784c8f7191a62a23b15a421185bc8a8"
uuid = "052768ef-5323-5732-b1bb-66c8b64840ba"
version = "4.0.1"

[[deps.CUDA_Driver_jll]]
deps = ["Artifacts", "JLLWrappers", "LazyArtifacts", "Libdl", "Pkg"]
git-tree-sha1 = "75d7896d1ec079ef10d3aee8f3668c11354c03a1"
uuid = "4ee394cb-3365-5eb0-8335-949819d2adfc"
version = "0.2.0+0"

[[deps.CUDA_Runtime_Discovery]]
deps = ["Libdl"]
git-tree-sha1 = "58dd8ec29f54f08c04b052d2c2fa6760b4f4b3a4"
uuid = "1af6417a-86b4-443c-805f-a4643ffb695f"
version = "0.1.1"

[[deps.CUDA_Runtime_jll]]
deps = ["Artifacts", "CUDA_Driver_jll", "JLLWrappers", "LazyArtifacts", "Libdl", "Pkg", "TOML"]
git-tree-sha1 = "d3e6ccd30f84936c1a3a53d622d85d7d3f9b9486"
uuid = "76a88914-d11a-5bdc-97e0-2f5a05c973a2"
version = "0.2.3+2"

[[deps.CUDNN_jll]]
deps = ["Artifacts", "CUDA_Runtime_jll", "JLLWrappers", "LazyArtifacts", "Libdl", "Pkg", "TOML"]
git-tree-sha1 = "57011df4fce448828165e566af9befa2ea94350a"
uuid = "62b44479-cb7b-5706-934f-f13b2eb2e645"
version = "8.6.0+3"

[[deps.AbstractFFTs]]
deps = ["ChainRulesCore", "LinearAlgebra"]
git-tree-sha1 = "69f7020bd72f069c219b5e8c236c1fa90d2cb409"
uuid = "621f4979-c628-5d54-868e-fcf4e3e8185c"
version = "1.2.1"

[[deps.Adapt]]
deps = ["LinearAlgebra"]
git-tree-sha1 = "0310e08cb19f5da31d08341c6120c047598f5b9c"
uuid = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
version = "3.5.0"

[[deps.BFloat16s]]
deps = ["LinearAlgebra", "Printf", "Random", "Test"]
git-tree-sha1 = "dbf84058d0a8cbbadee18d25cf606934b22d7c66"
uuid = "ab4f0b2a-ad5b-11e8-123f-65d77653426b"
version = "0.4.2"

[[deps.CEnum]]
git-tree-sha1 = "eb4cb44a499229b3b8426dcfb5dd85333951ff90"
uuid = "fa961155-64e5-5f13-b03f-caf6b980ea82"
version = "0.4.2"

[[deps.GPUArrays]]
deps = ["Adapt", "GPUArraysCore", "LLVM", "LinearAlgebra", "Printf", "Random", "Reexport", "Serialization", "Statistics"]
git-tree-sha1 = "4dfaff044eb2ce11a897fecd85538310e60b91e6"
uuid = "0c68f7d7-f131-5f86-a1c3-88cf8149b2d7"
version = "8.6.2"

[[deps.GPUArraysCore]]
deps = ["Adapt"]
git-tree-sha1 = "57f7cde02d7a53c9d1d28443b9f11ac5fbe7ebc9"
uuid = "46192b85-c4d5-4398-a991-12ede77f4527"
version = "0.1.3"

[[deps.GPUCompiler]]
deps = ["ExprTools", "InteractiveUtils", "LLVM", "Libdl", "Logging", "TimerOutputs", "UUIDs"]
git-tree-sha1 = "95185985a5d2388c6d0fedb06181ad4ddd40e0cb"
uuid = "61eb1bfa-7361-4325-ad38-22787b887f55"
version = "0.17.2"

[[deps.LLVM]]
deps = ["CEnum", "LLVMExtra_jll", "Libdl", "Printf", "Unicode"]
git-tree-sha1 = "df115c31f5c163697eede495918d8e85045c8f04"
uuid = "929cbde3-209d-540e-8aea-75f648917ca0"
version = "4.16.0"

Version info

Details on Julia: Julia 1.8.2

julia> versioninfo()
Julia Version 1.8.2
Commit 36034abf26 (2022-09-29 15:21 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × AMD Ryzen 7 4800H with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver2)
  Threads: 4 on 16 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 4

Details on CUDA:

julia> CUDA.versioninfo()
CUDA runtime 11.8, artifact installation
CUDA driver 12.0
NVIDIA driver 527.56.0

Libraries:
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 12.0.0+527.56

Toolchain:
- Julia: 1.8.2
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA GeForce GTX 1650 Ti (sm_75, 2.901 GiB / 4.000 GiB available)

Feb 13 '23 20:02 gabrielpreviato

Hmm, that's not enough information to help you. I see EDITOR=code in there; are you execution this from the VScode terminal? If so, https://github.com/JuliaGPU/CUDA.jl/issues/875

Feb 13 '23 21:02 maleadt

Same happens in a REPL outside VSCode. I think the problem may be the Statistics implementantion of the mean function, but is works pretty well when using the CPU.

Example:

julia> A = CUDA.ones((640, 640, 32, 1))

julia> B = ones((640, 640, 32, 1))

julia> mean(A; dims=1)
1×640×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:
[:, :, 1, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994

[:, :, 2, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994

[:, :, 3, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994

;;; …

[:, :, 30, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994

[:, :, 31, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994

[:, :, 32, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994

julia> mean(A; dims=1)
1×640×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:
[:, :, 1, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994

[:, :, 2, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994

[:, :, 3, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994

;;; …

[:, :, 30, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994

[:, :, 31, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994

[:, :, 32, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994

julia> mean(A; dims=[1,2])
1×1×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:
[:, :, 1, 1] =
 1.0000081

[:, :, 2, 1] =
 1.0000081

[:, :, 3, 1] =
 1.0000081

;;; …

[:, :, 30, 1] =
 1.0000081

[:, :, 31, 1] =
 1.0000081

[:, :, 32, 1] =
 1.0000081

julia> mean(A; dims=[1,2,3])
1×1×1×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:
[:, :, 1, 1] =
 1.0000155

julia> mean(B; dims=[1,2,3])
1×1×1×1 Array{Float64, 4}:
[:, :, 1, 1] =
 1.0

julia> versioninfo()
Julia Version 1.8.2
Commit 36034abf26 (2022-09-29 15:21 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × AMD Ryzen 7 4800H with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver2)
  Threads: 1 on 16 virtual cores

Feb 13 '23 22:02 gabrielpreviato

I think I found maybe who is causing this, the Statistics mean function uses the _mean that is override in GPUArrays. function Statistics._mean(f, A::AbstractGPUArray, dims)

It seems when you do the type conversion from the inverse of the size of the reduced dimensions (Float64) to Float32 (in the situation where you have a Float32 CuArray) there is a small error that creates the behavior I mentioned above.

I executed this _mean function statement by statement and found this:

julia> A = CUDA.ones((640, 640, 32, 1))
640×640×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:

julia> T = float(eltype(A))
Float32

julia> λ = convert(T, inv(_mean_denom(A, dims)))
0.0015625f0

julia> sum(Base.Fix1(*,λ), A; dims)
1×640×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:
[:, :, 1, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994

[:, :, 2, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994

[:, :, 3, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994

;;; …

[:, :, 30, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994

[:, :, 31, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994

[:, :, 32, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  …  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994

But with a smaller Array this does not happen.

julia> A = CUDA.ones((320, 320, 32, 1))
320×320×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:

julia> T = float(eltype(A))
Float32

julia> λ = convert(T, inv(_mean_denom(A, dims)))
0.003125f0

julia> sum(Base.Fix1(*,λ), A; dims)
1×320×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:
[:, :, 1, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

[:, :, 2, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

[:, :, 3, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

;;; …

[:, :, 30, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

[:, :, 31, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

[:, :, 32, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

@mcabbott can you also check this?

Feb 13 '23 23:02 gabrielpreviato

I don't see this with the size above:

julia> size(a)
(640, 640, 32, 1)

julia> mean(a; dims=1)[1:2]
2-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 1.0
 1.0

(@v1.10) pkg> st CUDA GPUArrays
Status `~/.julia/environments/v1.10/Project.toml`
  [052768ef] CUDA v4.0.1
  [0c68f7d7] GPUArrays v8.6.2

julia> nextfloat(0.999994f0, 101)
1.0f0

What is the actual value you get? The compact printing may have lost some digits from 0.999994.

Feb 14 '23 00:02 mcabbott

julia> mean(A; dims=1)[1:2]
2-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 0.9999937
 0.9999937

(@v1.8) pkg> st CUDA GPUArrays
Project YOLOv7 v0.1.0
Status `C:\Users\gabri\.julia\environments\v1.8\Project.toml`
  [052768ef] CUDA v4.0.1
  [0c68f7d7] GPUArrays v8.6.2

julia> nextfloat(0.9999937f0, 101)
0.9999997f0

I made the array larger just to check, and look:

julia> size(A)
(1280, 1280, 32, 1)

julia> mean(A; dims=1)[1:2]
2-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 0.9999892
 0.9999892

I notice this in my Flux BatchNorm layers, they were producing some wrong values, and that's how I ended up seeing this problem with larger arrays.

Another example with smaller values.

julia> B = CUDA.fill(1.0f-5, (640, 640, 32, 1))
640×640×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:

julia> mean(B; dims=1)[1:2]
2-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 1.0000034f-5
 1.0000034f-5

Feb 14 '23 01:02 gabrielpreviato

Some accuracy experiments (on CPU):

julia> function countepsfrom(x::T, xtrue) where {T<:AbstractFloat}
           target = T(xtrue)
           for n in Iterators.flatten(zip(0:100, -1:-1:-100))
               nextfloat(x, n) === target && return n
           end
           return (target - x) / eps(x)
       end;

julia> mean1(x, λ=convert(eltype(x), inv(length(x)))) = sum(Base.Fix1(*,λ), x);  # like CUDA at present

julia> mean1(ones(Float32, 640))
0.9999992f0

julia> countepsfrom(ans, 1)
13

julia> mean2(x, l=length(x)) = sum(Base.Fix2(/,l), x);  # divide instead, same result

julia> mean2(ones(Float32, 640))
0.9999992f0

julia> countepsfrom(ans, 1)
13

julia> mean3(x) = sum(x)/length(x);  # naiive sum then divide

julia> mean3(ones(Float32, 640))
1.0f0

Float16 to look for overflow:

julia> mean1(ones(Float16, 10^5))
Float16(1.002)

julia> countepsfrom(ans, 1)
-2

julia> mean2(ones(Float16, 10^5))
Float16(0.0)

julia> mean3(ones(Float16, 10^5))
NaN16

julia> x = rand(Float16, 10^5);

julia> xbar = mean(big.(x));  # true result

julia> countepsfrom(mean1(x), xbar)
-2

julia> countepsfrom(mean2(x), xbar)
Inf16

julia> countepsfrom(mean3(x), xbar)
Inf16

Feb 14 '23 01:02 mcabbott

The solution I proposed in https://github.com/JuliaGPU/GPUArrays.jl/pull/453 will fail with the Float16 but not for Float32.

julia> mean4(x, λ=convert(eltype(x), inv(length(x)))) = sum(x) .* λ;

julia> mean4(ones(Float32, 640))
1.0f0

julia> mean4(ones(Float16, 10^5))
Inf16

julia> x = rand(Float16, 10^5);

julia> xbar = mean(big.(x));

julia> countepsfrom(mean4(x), xbar)
-2

julia> mean1(ones(Float32, 10^9))
0.99999976f0

julia> countepsfrom(ans, 1)
4

julia> mean4(ones(Float32, 10^9))
1.0f0

I never worked with Float16, so I can't imagine how to deal with it in these operations.

Feb 14 '23 01:02 gabrielpreviato

The reason this doesn't reproduce consistently is the big_mapreduce_threshold cutoff that mapreduce uses.

Feb 14 '23 11:02 maleadt

I noticed very strange behaviour for the summation as well (mean behaves the same for me) where it starts producing completely off results:

julia> test_array = CuArray(rand(Float32, 50_000, 2, 16));

julia> results = map(_ -> sum(test_array, dims=1), 1:10)
 [25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]
 [25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]
 [25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]
 [25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]
 [25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]
 [25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]
 [25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]
 [25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]
 [25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]
 [25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]

julia> results = map(_ -> sum(test_array, dims=1), 1:10)
10-element Vector{CuArray{Float32, 3, CUDA.DeviceMemory}}:
 [75032.81 74824.22;;; 74833.0 74942.375;;; 74767.8 74971.5;;; … ;;; 75398.83 75095.41;;; 75081.91 75143.46;;; 74982.7 75084.875]
 [75032.81 74824.22;;; 74833.0 74942.375;;; 74767.8 74971.5;;; … ;;; 75398.83 75095.41;;; 75081.91 75143.46;;; 74982.7 75084.875]
 [75032.81 74824.22;;; 74833.0 74942.375;;; 74767.8 74971.5;;; … ;;; 75398.83 75095.41;;; 75081.91 75143.46;;; 74982.7 75084.875]
 [75032.81 74824.22;;; 74833.0 74942.375;;; 74767.8 74971.5;;; … ;;; 75398.83 75095.41;;; 75081.91 75143.46;;; 74982.7 75084.875]
 [75032.81 74824.22;;; 74833.0 74942.375;;; 74767.8 74971.5;;; … ;;; 75398.83 75095.41;;; 75081.91 75143.46;;; 74982.7 75084.875]
 [25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]
 [25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]
 [25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]
 [25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]
 [25010.938 24941.406;;; 24944.332 24980.791;;; 24922.6 24990.5;;; … ;;; 25132.943 25031.8;;; 25027.303 25047.82;;; 24994.234 25028.293]

What is interesting is dims=2 and dims=3 are entirely unaffected, also cannot reproduce if the tensor is smaller, e.g. (500, 2, 16)

Sep 26 '25 11:09 bvdmitri

That's unrelated; there was a bug with mapreduce that's been fixed on #master.

Oct 07 '25 11:10 maleadt