This is an alternative implementation to #86.

Here following https://github.com/JuliaStats/Statistics.jl/pull/86#discussion_r704274221 I perform partial sorting incrementally.

I make this a separate PR to allow an easy comparison of both. Either one or the other should be merged.

Oct 17 '21 10:10 bkamins

Codecov Report

Merging #91 (fb4c8d8) into master (74897fe) will increase coverage by 0.02%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #91      +/-   ##
==========================================
+ Coverage   96.89%   96.91%   +0.02%     
==========================================
  Files           1        1              
  Lines         419      422       +3     
==========================================
+ Hits          406      409       +3     
  Misses         13       13

Impacted Files	Coverage Δ
src/Statistics.jl	`96.91% <100.00%> (+0.02%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 74897fe...fb4c8d8. Read the comment docs.

Oct 17 '21 10:10 codecov[bot]

The only drawback of this approach is the case when very many quantiles are requested as we sort p and then perform many partial-sorts. Maybe we should use some threshold on p and use the old algorithm if it is long?

Oct 17 '21 11:10 bkamins

Thanks. Is there any reason to think that a series of partial sorts of nested subsets of the data would be significantly slower than a single full sort? Have you tried benchmarking this?

Oct 17 '21 13:10 nalimilan

Is there any reason to think that a series of partial sorts of nested subsets of the data would be significantly slower than a single full sort?

We have to sort p, so it its length is comparable to length of the vector in which we are analyzing we have to essentially perform the sorting twice. I will try to benchmark this and post the results.

Oct 17 '21 17:10 bkamins

Here are the benchmarks:

julia> function f1(n)
       x = rand(10^6); x2 = copy(x)
       p = rand(n)
       @time Statistics._quantilesort!(x, false, extrema(p)...)
       @time new_quantilesort!(x, false, p)
       nothing
       end
f1 (generic function with 1 method)

julia> f1(5)
  0.104513 seconds
  0.018321 seconds (1 allocation: 128 bytes)

julia> f1(5)
  0.075638 seconds
  0.015375 seconds (1 allocation: 128 bytes)

julia> f1(5)
  0.100432 seconds
  0.018790 seconds (1 allocation: 128 bytes)

julia> f1(50)
  0.099497 seconds
  0.130415 seconds (1 allocation: 496 bytes)

julia> f1(50)
  0.101801 seconds
  0.106628 seconds (1 allocation: 496 bytes)

julia> f1(50)
  0.109406 seconds
  0.105568 seconds (1 allocation: 496 bytes)

julia> f1(500)
  0.112124 seconds
  1.050222 seconds (1 allocation: 4.062 KiB)

julia> f1(500)
  0.106582 seconds
  1.027544 seconds (1 allocation: 4.062 KiB)

so as you can see it starts to deteriorate much faster.

(as usual - it would not hurt if you double checked this if you had time as I might have made some error here)

Oct 17 '21 17:10 bkamins

I'm not sure it's worth worrying about performance when the number of quantiles is large compared to the data. Quantiles don't make a lot of sense in that case.

Maybe a simple optimization is to do sort_p = issorted(p) ? p : sort(p) to avoid making a copy if possible (as that will be the case almost all the time).

Sep 06 '24 17:09 nalimilan

Improve quantile performance v2

Codecov Report