Implements Batcher's bitonic mergesort. This algorithm effectively implements a sorting network, but can also be understood as a sorting algorithm.

Basing ourselves on the layout explained in this Wikipedia section: https://en.wikipedia.org/wiki/Bitonic_sorter#Alternative_representation it becomes simple to implement a version that works with inputs of arbitrary lengths. We just need to assume inf for all the complementary entries that would make the array size a power-of-two, and simply skip them in the comparisons.

The functions have been implemented forcing some variables to be Val, so that loop vectorizations are more likely to be triggered by the compiler. This seems challenging otherwise, especially for the first step with the backwards iteration.

No experiments have been made yet regarding speed, but it seems good vectorized code is being produced at least in some cases, and tests are already passing, so I thought I could go ahead and get some feedback already.

This PR was inspired by the previous discussion with @LilithHafner in #54.

Oct 11 '22 19:10 nlw0

fwiw, here's a first draft of an implementation that uses bit-twiddling to have simpler control flow (i.e. fewer and longer inner loops):

using Base: Order

function bitonic_merge_sort!(v::AbstractVector, o::Ordering=Forward)
    n = Int(length(v))
    fi = firstindex(v)
    for k in 0:8sizeof(n-1)-leading_zeros(n-1)-1
        lo_mask = 1<<k-1
        i = 0
        while true
            lo_bits = i & lo_mask
            hi_bits = (i & ~lo_mask) << 1
            hi = hi_bits + lo_bits + lo_mask + 1
            hi >= n && break
            lo = hi_bits + lo_bits ⊻ lo_mask
            @inbounds swap!(v, lo+fi, hi+fi, o)
            i += 1
        end
        for j in k-1:-1:0
            lo_mask = 1<<j-1
            i = 0
            while true
                lo = i & lo_mask + (i & ~lo_mask) << 1
                hi = lo + lo_mask + 1
                hi >= n && break
                @inbounds swap!(v, lo+fi, hi+fi, o)
                i += 1
            end
        end
    end
    v
end

Base.@propagate_inbounds function swap!(v, i, j, o)
    a, b = v[i], v[j]
    v[i], v[j] = lt(o, b, a) ? (b, a) : (a, b)
end

for _ in 1:1000
    issorted(bitonic_merge_sort!(OffsetVector(rand(1:10,rand(1:10)),rand(-10:10)))) || error()
    issorted(bitonic_merge_sort!(OffsetVector(rand(1:10,rand(1:10)),rand(-10:10)), Reverse), rev=true) || error()
end

julia> @btime sort!(x) setup=(x=rand(Int, 80)) evals=1;
  1.543 μs (0 allocations: 0 bytes)

julia> @btime bitonic_merge_sort!(x) setup=(x=rand(Int, 80)) evals=1;
  1.752 μs (0 allocations: 0 bytes)

julia> @btime sort!(x; alg=BitonicSort) setup=(x=rand(Int, 80)) evals=1;
  73.077 μs (28 allocations: 1.31 KiB)

I have yet to find a case where it outperforms the default algorithm (though I happen to be running on a branch which has pretty good default algorithms)

Oct 12 '22 13:10 LilithHafner

@LilithHafner that's great! I think I saw something similar in examples out there, but I didn't really understand until now... Have you checked if the loop vectorizations from the compiler are kicking in, though? How does it compare to the original code? Also I would expect that at least for very short lists it should perform great...

Oct 12 '22 15:10 nlw0

Have you checked if the loop vectorizations from the compiler are kicking in, though?

I don't see any simd instructions in @code_llvm bitonic_merge_sort!([1,2,3], Forward).

How does it compare to the original code?

About 50x faster than the PR code according to the benchmarks I posted above. Results ae similar for longer vectors but the default algorithms have asymttoic runtime of O(n log n) and O(n), so bitonic merge sort with O(n (log n)^2) runtime will probably only shine for small ish inputs. The benchmarks ae incredibly unfair, though, because the PR still has some major performance issues that can probably be fairly easily resolved (https://github.com/JuliaCollections/SortingAlgorithms.jl/pull/62#discussion_r993337169 & https://github.com/JuliaCollections/SortingAlgorithms.jl/pull/62#discussion_r993076251).

Oct 13 '22 01:10 LilithHafner

Codecov Report

Merging #62 (d4c23d9) into master (80c14f5) will increase coverage by 0.33%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #62      +/-   ##
==========================================
+ Coverage   96.51%   96.85%   +0.33%     
==========================================
  Files           1        1              
  Lines         344      381      +37     
==========================================
+ Hits          332      369      +37     
  Misses         12       12

Impacted Files	Coverage Δ
src/SortingAlgorithms.jl	`96.85% <100.00%> (+0.33%)`	:arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

Oct 13 '22 07:10 codecov-commenter

I made some of the changes, hopefully I can look at your implementation later today in more detail.

Oct 13 '22 07:10 nlw0

I agree, but I'm not sure how to call it. "comparator" is the term used in the sorting networks literature, that's why I was considering it. Ideally it should be a verb, but "compare" also suggests a predicate. Maybe sort2!(a, b, o)?

On Thu, 13 Oct 2022, 09:48 Lilith Orion Hafner, @.***> wrote:

@.**** commented on this pull request.

In src/SortingAlgorithms.jl https://github.com/JuliaCollections/SortingAlgorithms.jl/pull/62#discussion_r994278953 :
           a, b = v[ia + 1], v[ib + 1]
           v[ia+1], v[ib+1] = lt(o, b, a) ? (b, a) : (a, b)
A comparator is a function that takes two (or sometimes three or more) inputs and determines which is larger. lt(order, _, _) and isless are comparators. This function also conditionally swaps the elements. There may be better names for it than swap, but I don't like comparator.

https://en.wikipedia.org/wiki/Comparator https://clojure.org/guides/comparators https://docs.oracle.com/javase/8/docs/api/java/util/Comparator.html

— Reply to this email directly, view it on GitHub https://github.com/JuliaCollections/SortingAlgorithms.jl/pull/62#discussion_r994278953, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABHKESPAL2TQACF5Z6NGJTWC65DJANCNFSM6AAAAAARCTT5YM . You are receiving this because you authored the thread.Message ID: @.*** com>

Oct 13 '22 11:10 nlw0

I've tried some benchmarking, with nasty results. I guess similar to what you observed. I find it strange, though, that I couldn't get reasonable times even for small inputs. I think I'll try something like generator functions or "hand-crafted" implementations with simd just to see if there's any potential and find a clue for what's up going on...

Oct 13 '22 20:10 nlw0

Here's an implementation in Julia: https://github.com/JuliaArrays/StaticArrays.jl/blob/v1.5.9/src/sort.jl#L50-L89

Oct 25 '22 02:10 LilithHafner

Nice! I don't think I had seen that before. I guess generated functions are really the way to go.

In the comments they say only inputs up to a certain size are accepted because of the sub-optimal complexity, but I'd argue that because of parallelism it's hard to know for sure where is that point, what is also supported by the good performance of combsort. Maybe we could consider to keep working on a more generic version to see how it goes. We probably need generated functions to get good machine code, though.

I wonder if there's in the literature already some kind of modification to combsort to make it look more like bitonic merge sort. The main difference I see, apart from the fact the sub-list sizes are all in powers of two, is that the intervals grow first, then shrink, then grow again. If we could prove that any of these things puts a hard limit on the maximum distance of an entry to its final position, that would be it, I think.

Oct 25 '22 12:10 nlw0

Bitonic mergesort

Codecov Report

@.**** commented on this pull request.