x86-simd-sort icon indicating copy to clipboard operation
x86-simd-sort copied to clipboard

Enable fp16 nonnative support for dynamic dispatch, make more ergonomic for static dispatch

Open sterrettm2 opened this issue 9 months ago • 0 comments

This patch enabled the non-avx512fp16 _Float16 sorting to be used by the dynamic dispatch logic, as well as integrating it better into the static dispatch logic. It is vastly faster than scalar, but a fair bit slower then the dedicated avx512fp16 code.

Comparison to scalar
Benchmark                                                       Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------------------------------
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9264         -0.9269          6368           468          6373           466
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9399         -0.9394         13394           804         13401           813
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9412         -0.9410         29552          1737         29560          1745
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9209         -0.9208         75451          5967         75463          5975
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9396         -0.9396        590828         35676        590792         35680
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9524         -0.9524      14506782        689933      14506540        689943
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9616         -0.9616     159229801       6113740     159217432       6113529
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9827         -0.9827    1739044872      30113868    1738990462      30113349
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9864         -0.9864   19075697953     259558909   19074535929     259512766
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9210         -0.9209          5579           441          5582           442
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9388         -0.9382         13171           806         13176           815
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9377         -0.9372         28020          1746         28025          1761
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9328         -0.9326         74876          5029         74879          5045
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9469         -0.9469        585484         31094        585483         31107
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9510         -0.9510      14291316        700636      14290814        700538
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9608         -0.9608     156622769       6146373     156621600       6145706
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9826         -0.9826    1729001307      30128303    1728922542      30127689
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9210         -0.9210        803746         63496        803743         63504
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9989         -0.9989        621695           656        621626           654
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9131         -0.9130        742937         64595        742908         64605
[scalarsort.*_Float16 vs. simdsort.*_Float16]_pvalue          0.0315          0.0315      U Test, Repetitions: 20 vs 20
OVERALL_GEOMEAN                                              -0.9595         -0.9595             0             0             0             0

Comparison to AVX512_FP16
Benchmark                                                         Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.6382         +0.6557           269           441           268           443
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.3411         +0.3500           604           810           605           816
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.7145         +0.7192          1014          1739          1017          1748
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +1.7644         +1.7643          2153          5951          2156          5959
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +1.8787         +1.8783         12278         35344         12282         35351
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.7758         +0.7759        385721        684977        385712        684986
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.3700         +0.3698       1368168       1874386       1368250       1874260
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.4479         +0.4479       5137910       7438989       5137603       7438745
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.1682         +0.1652      77774693      90857847      77385351      90173137
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.6533         +0.6521           267           441           268           442
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.3211         +0.3280           612           809           613           814
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.5179         +0.5210          1141          1732          1143          1738
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +1.6678         +1.6679          2200          5868          2202          5876
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +1.5668         +1.5672         12788         32824         12788         32830
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.7579         +0.7581        384179        675344        384134        675354
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.2080         +0.2081       1552580       1875503       1552298       1875373
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.4877         +0.4877       5052821       7516986       5052579       7516805
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +1.4458         +1.4462         25727         62922         25727         62934
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.5392         +0.5407           445           685           445           686
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +1.4454         +1.4455         25703         62854         25705         62862
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]_pvalue          0.5075          0.5075      U Test, Repetitions: 20 vs 20
OVERALL_GEOMEAN                                                +0.7601         +0.7624             0             0             0             0

sterrettm2 avatar Apr 24 '25 21:04 sterrettm2