More measures of presortedness
#48 added a few measures of presortedness into the library, and there still hasn't been a problem having them around. The original measures of presortedness were the eight ones described in Right invariant metrics and measures of presortedness, but there are actually more of them in the litterature. We could complete the module with the measures of presortedness described in the following papers:
- A framework for adaptive sorting by Petersson and Moffat
- A New Measure of Presortedness by Estivill-Castro and Wood
- Computing and ranking measures of presortedness by Chen
- Encroaching lists as a measure of presortedness by Skiena
- Historical searching and sorting by Moffat and Petersson
- Local insertion sort revisited by Katajainen, Levcopoulos and Petersson
- On partitions and presortedness of sequence by Carlsson and Chen
- Roughly Sorting: Sequential and Parallel Approach by Altman and Igarashi
- Sort Race by Zhang, Meng and Liang
- Sorting Shuffled Monotone Sequences by Levcopoulos and Petersson
The following to-be-implemented measures of presortedness can be found in the aforementioned articles:
- [x] Block(X): number of elements in a sequence that aren't followed by the same element in the sorted sequence
- [ ] Dist(X) and logDist(X)
- [x] DS(X)
- [x] Enc(X): number of encroaching lists extracted from X minus one
- [ ] Hist(X): something to do with historical insertion sort
- [x] ~~Lads(X): minimum size of monotonic subsequence covering of a sequence X~~ (computing it is NP-complete)
- [ ] Loc(X): something to do with local insertion sort
- [x] Mono(X): number of ascending and descending runs minus 1
- [x] ~~Par(X) = min { p | X is p-sorted }~~ (same as Dis(X))
- [x] ~~Radius(X) = min { k | X is k-sorted }~~ (same as Par(X))
- [ ] Reg(X): apparently a generalization of most of the other measures
- [ ] SMS(X): minimum number of non-decreasing and non-increasing subsequences (of possibly not adjacent elements) into which X can be partitioned
- [ ] Spart(T, X): minimal size of a specific type T of partitions on a sequence X
- [x] SUS(X): minimum number of non-decreasing subsequences (of possibly not adjacent elements) into which X can be partitioned
Comparing the definitions, I am pretty sure that Par(X) and Radius(X) are the same measures of presortedness, which means that k-sorted and p-sorted are actually the same thing. Roughly Sorting: Sequential and Parallel Approach even mentions A New Measure of Presortedness for the definition of k-sorted, even though the term k-sorted appears nowhere in the linked paper, which is all about p-sorted sequences.
A New Measure of Presortedness defines a p-sorted sequence as follows:
X is p-sorted iff for all i, j ∈ {1, 2, ..., |X|}, i - j > p implies Xj ≤ Xi
Roughly Sorting: Sequential and Parallel Approach defines a k-sorted sequence as follows:
for all i, j, 1 ≤ i ≤ j ≤ n, i ≤ j - k implies ai ≤ aj
If we substitute the symbols in the second definition with those of the first, we obtain the following definition instead (note that the i and j are inverted in both definitions):
for all i, j, 1 ≤ j ≤ i ≤ |X|, j ≤ i - k implies Xj ≤ Xi
The inequality j ≤ i - k is equivalent to i - j ≥ k, which makes it almost equivalent to the definition of a p-sorted value, save for those two differences:
- The definition of k-sorted uses ≥ while that of p-sorted uses >.
- The definition of k-sorted adds the 1 ≤ j ≤ i ≤ |X| restriction.
Without the additional restriction, i - j ≥ k will imply a negative value for j > i. Since k is used as the result of the measure of presortedness, it can't be a negative value by definition, so I believe that this additional restriction on the values of i and j is implicit in the definition of a p-sorted sequence.
The only difference left between the definition of the two measures is the strictness of the inequality in i - j ≥ k vs. i - j > p. However the paper An Optimal Sorting Algorithm for Presorted Sequences by H. Hamamura, J. Miyao and S. Wakabayashi defines Radius(X) as follows:
min { k | ∀i, j (1 ≤ i ≤ n, i < j - k) xi ≤ xj }
This definition corresponds to the one in Roughly Sorting: Sequential and Parallel Approach save for the inequality. Interestingly Roughly Sorting is still cited as the source of the definition of a k-sorted sequence, and yet its definition differs a bit and matches that of a p-sorted sequence.
My conclusion is that a p-sorted sequence and a k-sorted sequence are the same thing, which means that Par(X) and Radius(X) are the same measure of presortedness, and since we already have probe::par, we don't need probe::radius.
However, there comes the next interesting bit of trivia: Roughly Sorting [...] gives a method to compute Radius(X) in O(n) time on bidirectional iterators, while our current implementation of Par(X) runs in O(n² log n) and is restricted to random-access iterators. Despite those big improvements, said method runs in O(n) memory (two arrays of iterators + one of sizes) while outs runs in O(1) memory. It's a diagnostic tool, so the increased memory use surely matters less than a terrible complexity. However, that's out of scope for this issue, and more of in the realm of #85.
Commit 2cf80ba9dce5bab90679b17a3947519338ebe425 adds Block(X) to the library.