Inconsistent results w/ and w/o @turbo
Consider the following code
using LoopVectorization, OffsetArrays
function morph_dilate(A::AbstractArray{T,N}, kernel::AbstractArray{S,N}) where {T<:Integer,S<:Integer,N}
out = similar(A, tuple((first(axes(A, n)) + first(axes(kernel, n)):
last(axes(A, n)) + last(axes(kernel, n)) for n ∈ 1:N)...)...)
out .= zero(T)
Ks = CartesianIndices(kernel)
@turbo for I in CartesianIndices(A)
for K in Ks
out[I+K] |= A[I] & kernel[K]
end
end
out
end
[The function essentially performs a convolution with a given kernel of an input array A, using the integer and (&) and the or (|) instead of the product and sum in the convolution expression. It also extends the original input array A.]
When I use this function on a test input I obtain an unexpected result:
julia> morph_dilate([1 2 3; 4 5 6], OffsetArrays.centered([0 7 0; 7 7 7; 0 7 0]))
4×5 OffsetArray(::Matrix{Int64}, 0:3, 0:4) with eltype Int64 with indices 0:3×0:4:
0 1 0 3 0
1 5 3 7 3
4 5 6 7 6
0 4 0 6 0
Note that this result is wrong: the correct result, which can be obtained removing the @turbo macro in the definition above, is
julia> morph_dilate([1 2 3; 4 5 6], OffsetArrays.centered([0 7 0; 7 7 7; 0 7 0]))
4×5 OffsetArray(::Matrix{Int64}, 0:3, 0:4) with eltype Int64 with indices 0:3×0:4:
0 1 2 3 0
1 7 7 7 3
4 5 7 7 6
0 4 5 6 0
Am I using @turbo in an wrong way, or is that a bug?
Am I using
@turboin an wrong way, or is that a bug?
Bug/known limitation. It is fixed in LoopModels, the successor of LoopVectorization. LoopModels doesn't actually work yet, though, so it'll be a while before you can switch.
This blog post discusses how to fix it automatically (i.e. similar to what LoopModels does), but you can also do this manually, like SimpleChains.jl: https://github.com/PumasAI/SimpleChains.jl/blob/f028d69679d47f11d35e7f311abdf0d1d3bfab9c/src/conv.jl#L561-L581 The manual fix is awkward, because LoopVectorization also only supports rectangular loop nests (another limitation LoopModels fixes), so the workaround for LoopVectorization has to calculate the extrema, and then mask-off the parts that are out of bounds.
Any chance LoopModels will support Float16?
Any chance LoopModels will support Float16?
It is written as an LLVM pass, so it should more naturally support types supported by Julia and LLVM.
@chriselrod thank you very much for your prompt reply. I understand this is a known bug/limitation, but from a user perspective I have not seen it mentioned anywhere in the package documentation. Before your kind message, I did not even know what a tiling was (my ignorance, but I guess that not all users of LoopVectorization are experts in this topic). So I only now I realize the difficulty with the rewrite of this specific kind of loop operation.
In any case, since (correct me if I am wrong) there is no mention of this issue in the documentation, I feel that LoopVectorization should raise an error saying that this particular loop cannot be vectorized (as it does in other situations), rather than running without any apparent problem and producing incorrect results.
Well, actually, it could vectorize it in a less fancy way. So the check should probably just restrict it to the less fancy case.
but I guess that not all users of LoopVectorization are experts in this topic
Yeah, the goal of the repo (and LoopModels) is that users do not have to know anything.
OK, I am trying to fix manually the issue using this code
function morph_dilate2(A::AbstractArray{T,N}, kernel::AbstractArray{S,N}) where {T<:Integer,S<:Integer,N}
out = similar(A, tuple((first(axes(A, n))+first(axes(kernel, n)):
last(axes(A, n))+last(axes(kernel, n)) for n ∈ 1:N)...)...)
Ks = CartesianIndices(kernel)
Is = CartesianIndices(A)
Ifirst, Ilast = first(Is), last(Is)
Js = CartesianIndices(out)
@turbo for J ∈ Js
tmp = zero(T)
for K ∈ Ks
Aᵢ = (Ifirst <= J - K <= Ilast) ? A[J-K] : zero(T)
tmp |= Aᵢ & kernel[K]
end
out[J] = tmp
end
out
end
However, @turbo fails with an error message "Reduction not found." (which surprises me since the reduction is actually there). The issue seems to be with the use of CartesianIndex in the ? ... : operator, but if possible I would like to avoid unrolling the entire loops and rather work with a CartesianIndex.