LoopVectorization.jl Inconsistent results w/ and w/o @turbo

Consider the following code

using LoopVectorization, OffsetArrays

function morph_dilate(A::AbstractArray{T,N}, kernel::AbstractArray{S,N}) where {T<:Integer,S<:Integer,N}
    out = similar(A, tuple((first(axes(A, n)) + first(axes(kernel, n)):
                            last(axes(A, n)) + last(axes(kernel, n)) for n ∈ 1:N)...)...)
    out .= zero(T)
    Ks = CartesianIndices(kernel)
    @turbo for I in CartesianIndices(A)
        for K in Ks
            out[I+K] |= A[I] & kernel[K]
        end
    end
    out
end

[The function essentially performs a convolution with a given kernel of an input array A, using the integer and (&) and the or (|) instead of the product and sum in the convolution expression. It also extends the original input array A.]

When I use this function on a test input I obtain an unexpected result:

julia> morph_dilate([1 2 3; 4 5 6], OffsetArrays.centered([0 7 0; 7 7 7; 0 7 0]))
4×5 OffsetArray(::Matrix{Int64}, 0:3, 0:4) with eltype Int64 with indices 0:3×0:4:
 0  1  0  3  0
 1  5  3  7  3
 4  5  6  7  6
 0  4  0  6  0

Note that this result is wrong: the correct result, which can be obtained removing the @turbo macro in the definition above, is

julia> morph_dilate([1 2 3; 4 5 6], OffsetArrays.centered([0 7 0; 7 7 7; 0 7 0]))
4×5 OffsetArray(::Matrix{Int64}, 0:3, 0:4) with eltype Int64 with indices 0:3×0:4:
 0  1  2  3  0
 1  7  7  7  3
 4  5  7  7  6
 0  4  5  6  0

Am I using @turbo in an wrong way, or is that a bug?

Aug 29 '23 09:08 astrozot

Am I using @turbo in an wrong way, or is that a bug?

Bug/known limitation. It is fixed in LoopModels, the successor of LoopVectorization. LoopModels doesn't actually work yet, though, so it'll be a while before you can switch.

This blog post discusses how to fix it automatically (i.e. similar to what LoopModels does), but you can also do this manually, like SimpleChains.jl: https://github.com/PumasAI/SimpleChains.jl/blob/f028d69679d47f11d35e7f311abdf0d1d3bfab9c/src/conv.jl#L561-L581 The manual fix is awkward, because LoopVectorization also only supports rectangular loop nests (another limitation LoopModels fixes), so the workaround for LoopVectorization has to calculate the extrema, and then mask-off the parts that are out of bounds.

Aug 29 '23 11:08 chriselrod

Any chance LoopModels will support Float16?

Aug 29 '23 11:08 ctkelley

Any chance LoopModels will support Float16?

It is written as an LLVM pass, so it should more naturally support types supported by Julia and LLVM.

Aug 29 '23 12:08 chriselrod

@chriselrod thank you very much for your prompt reply. I understand this is a known bug/limitation, but from a user perspective I have not seen it mentioned anywhere in the package documentation. Before your kind message, I did not even know what a tiling was (my ignorance, but I guess that not all users of LoopVectorization are experts in this topic). So I only now I realize the difficulty with the rewrite of this specific kind of loop operation.

In any case, since (correct me if I am wrong) there is no mention of this issue in the documentation, I feel that LoopVectorization should raise an error saying that this particular loop cannot be vectorized (as it does in other situations), rather than running without any apparent problem and producing incorrect results.

Aug 29 '23 13:08 astrozot

Well, actually, it could vectorize it in a less fancy way. So the check should probably just restrict it to the less fancy case.

but I guess that not all users of LoopVectorization are experts in this topic

Yeah, the goal of the repo (and LoopModels) is that users do not have to know anything.

Aug 29 '23 14:08 chriselrod

OK, I am trying to fix manually the issue using this code

function morph_dilate2(A::AbstractArray{T,N}, kernel::AbstractArray{S,N}) where {T<:Integer,S<:Integer,N}
    out = similar(A, tuple((first(axes(A, n))+first(axes(kernel, n)):
                            last(axes(A, n))+last(axes(kernel, n)) for n ∈ 1:N)...)...)
    Ks = CartesianIndices(kernel)
    Is = CartesianIndices(A)
    Ifirst, Ilast = first(Is), last(Is)
    Js = CartesianIndices(out)
    @turbo for J ∈ Js
        tmp = zero(T)
        for K ∈ Ks
            Aᵢ = (Ifirst <= J - K <= Ilast) ? A[J-K] : zero(T)
            tmp |= Aᵢ & kernel[K]
        end
        out[J] = tmp
    end
    out
end

However, @turbo fails with an error message "Reduction not found." (which surprises me since the reduction is actually there). The issue seems to be with the use of CartesianIndex in the ? ... : operator, but if possible I would like to avoid unrolling the entire loops and rather work with a CartesianIndex.

Aug 29 '23 16:08 astrozot