Faster sorting (run formation)
In the merge sorter, we need to overlap internal sorting (highly CPU bound) and writing the sorted runs (highly I/O bound).
We should split the available internal memory in two, using one half for accumulating a sorted run and the other half as an I/O buffer for the sorted elements. Sorting and I/O should be overlapped explicitly (e.g. with two separate threads) to run both in parallel.
An exploratory implementation of the merge sorter has been implemented in the parallel-merge-phase branch. A progress document has been created for this issue in order to keep track of test results. https://gist.github.com/svendcsvendsen/11253182
This branch by Svend seems to have been removed and the gist is also dead. Is this still a desired feature or did it turn out not to improve performance as much as one hoped?