Bitonic mergesort
Implements Batcher's bitonic mergesort. This algorithm effectively implements a sorting network, but can also be understood as a sorting algorithm.
Basing ourselves on the layout explained in this Wikipedia section:
https://en.wikipedia.org/wiki/Bitonic_sorter#Alternative_representation
it becomes simple to implement a version that works with inputs of arbitrary lengths. We just need to assume inf for all the complementary entries that would make the array size a power-of-two, and simply skip them in the comparisons.
The functions have been implemented forcing some variables to be Val, so that loop vectorizations are more likely to be triggered by the compiler. This seems challenging otherwise, especially for the first step with the backwards iteration.
No experiments have been made yet regarding speed, but it seems good vectorized code is being produced at least in some cases, and tests are already passing, so I thought I could go ahead and get some feedback already.
This PR was inspired by the previous discussion with @LilithHafner in #54.
fwiw, here's a first draft of an implementation that uses bit-twiddling to have simpler control flow (i.e. fewer and longer inner loops):
using Base: Order
function bitonic_merge_sort!(v::AbstractVector, o::Ordering=Forward)
n = Int(length(v))
fi = firstindex(v)
for k in 0:8sizeof(n-1)-leading_zeros(n-1)-1
lo_mask = 1<<k-1
i = 0
while true
lo_bits = i & lo_mask
hi_bits = (i & ~lo_mask) << 1
hi = hi_bits + lo_bits + lo_mask + 1
hi >= n && break
lo = hi_bits + lo_bits ⊻ lo_mask
@inbounds swap!(v, lo+fi, hi+fi, o)
i += 1
end
for j in k-1:-1:0
lo_mask = 1<<j-1
i = 0
while true
lo = i & lo_mask + (i & ~lo_mask) << 1
hi = lo + lo_mask + 1
hi >= n && break
@inbounds swap!(v, lo+fi, hi+fi, o)
i += 1
end
end
end
v
end
Base.@propagate_inbounds function swap!(v, i, j, o)
a, b = v[i], v[j]
v[i], v[j] = lt(o, b, a) ? (b, a) : (a, b)
end
for _ in 1:1000
issorted(bitonic_merge_sort!(OffsetVector(rand(1:10,rand(1:10)),rand(-10:10)))) || error()
issorted(bitonic_merge_sort!(OffsetVector(rand(1:10,rand(1:10)),rand(-10:10)), Reverse), rev=true) || error()
end
julia> @btime sort!(x) setup=(x=rand(Int, 80)) evals=1;
1.543 μs (0 allocations: 0 bytes)
julia> @btime bitonic_merge_sort!(x) setup=(x=rand(Int, 80)) evals=1;
1.752 μs (0 allocations: 0 bytes)
julia> @btime sort!(x; alg=BitonicSort) setup=(x=rand(Int, 80)) evals=1;
73.077 μs (28 allocations: 1.31 KiB)
I have yet to find a case where it outperforms the default algorithm (though I happen to be running on a branch which has pretty good default algorithms)
@LilithHafner that's great! I think I saw something similar in examples out there, but I didn't really understand until now... Have you checked if the loop vectorizations from the compiler are kicking in, though? How does it compare to the original code? Also I would expect that at least for very short lists it should perform great...
Have you checked if the loop vectorizations from the compiler are kicking in, though?
I don't see any simd instructions in @code_llvm bitonic_merge_sort!([1,2,3], Forward).
How does it compare to the original code?
About 50x faster than the PR code according to the benchmarks I posted above. Results ae similar for longer vectors but the default algorithms have asymttoic runtime of O(n log n) and O(n), so bitonic merge sort with O(n (log n)^2) runtime will probably only shine for small ish inputs. The benchmarks ae incredibly unfair, though, because the PR still has some major performance issues that can probably be fairly easily resolved (https://github.com/JuliaCollections/SortingAlgorithms.jl/pull/62#discussion_r993337169 & https://github.com/JuliaCollections/SortingAlgorithms.jl/pull/62#discussion_r993076251).
Codecov Report
Merging #62 (d4c23d9) into master (80c14f5) will increase coverage by
0.33%. The diff coverage is100.00%.
@@ Coverage Diff @@
## master #62 +/- ##
==========================================
+ Coverage 96.51% 96.85% +0.33%
==========================================
Files 1 1
Lines 344 381 +37
==========================================
+ Hits 332 369 +37
Misses 12 12
| Impacted Files | Coverage Δ | |
|---|---|---|
| src/SortingAlgorithms.jl | 96.85% <100.00%> (+0.33%) |
:arrow_up: |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
I made some of the changes, hopefully I can look at your implementation later today in more detail.
I agree, but I'm not sure how to call it. "comparator" is the term used in
the sorting networks literature, that's why I was considering it. Ideally
it should be a verb, but "compare" also suggests a predicate. Maybe
sort2!(a, b, o)?
On Thu, 13 Oct 2022, 09:48 Lilith Orion Hafner, @.***> wrote:
@.**** commented on this pull request.
In src/SortingAlgorithms.jl https://github.com/JuliaCollections/SortingAlgorithms.jl/pull/62#discussion_r994278953 :
a, b = v[ia + 1], v[ib + 1]
v[ia+1], v[ib+1] = lt(o, b, a) ? (b, a) : (a, b)A comparator is a function that takes two (or sometimes three or more) inputs and determines which is larger. lt(order, _, _) and isless are comparators. This function also conditionally swaps the elements. There may be better names for it than swap, but I don't like comparator.
https://en.wikipedia.org/wiki/Comparator https://clojure.org/guides/comparators https://docs.oracle.com/javase/8/docs/api/java/util/Comparator.html
— Reply to this email directly, view it on GitHub https://github.com/JuliaCollections/SortingAlgorithms.jl/pull/62#discussion_r994278953, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABHKESPAL2TQACF5Z6NGJTWC65DJANCNFSM6AAAAAARCTT5YM . You are receiving this because you authored the thread.Message ID: @.*** com>
I've tried some benchmarking, with nasty results. I guess similar to what you observed. I find it strange, though, that I couldn't get reasonable times even for small inputs. I think I'll try something like generator functions or "hand-crafted" implementations with simd just to see if there's any potential and find a clue for what's up going on...
Here's an implementation in Julia: https://github.com/JuliaArrays/StaticArrays.jl/blob/v1.5.9/src/sort.jl#L50-L89
Nice! I don't think I had seen that before. I guess generated functions are really the way to go.
In the comments they say only inputs up to a certain size are accepted because of the sub-optimal complexity, but I'd argue that because of parallelism it's hard to know for sure where is that point, what is also supported by the good performance of combsort. Maybe we could consider to keep working on a more generic version to see how it goes. We probably need generated functions to get good machine code, though.
I wonder if there's in the literature already some kind of modification to combsort to make it look more like bitonic merge sort. The main difference I see, apart from the fact the sub-list sizes are all in powers of two, is that the intervals grow first, then shrink, then grow again. If we could prove that any of these things puts a hard limit on the maximum distance of an entry to its final position, that would be it, I think.