Pull out tmp from inner loop
I think the tmp variable should be moved out from the inner loop. The current implementation moves the selected element by swapping, instead of writing it once where it should be. Is it even insertion sort right now?
This needs to be tested against the benchmarks.
The effect of this bug is that the function is slower than it can be. It does, however, still work.
I will try to fix it this evening and publish a new release.
My initial tests show that the code size grows by doing this. So this would be a tradeoff between speed and code size. Adding this fix would cut the sorting time in half (about half the number of memory writes), but I always prefer code size in that case. Will take another look later.