x86-simd-sort
x86-simd-sort copied to clipboard
Enable fp16 nonnative support for dynamic dispatch, make more ergonomic for static dispatch
This patch enabled the non-avx512fp16 _Float16 sorting to be used by the dynamic dispatch logic, as well as integrating it better into the static dispatch logic. It is vastly faster than scalar, but a fair bit slower then the dedicated avx512fp16 code.
Comparison to scalar
Benchmark Time CPU Time Old Time New CPU Old CPU New
--------------------------------------------------------------------------------------------------------------------------------------------
[scalarsort.*_Float16 vs. simdsort.*_Float16] -0.9264 -0.9269 6368 468 6373 466
[scalarsort.*_Float16 vs. simdsort.*_Float16] -0.9399 -0.9394 13394 804 13401 813
[scalarsort.*_Float16 vs. simdsort.*_Float16] -0.9412 -0.9410 29552 1737 29560 1745
[scalarsort.*_Float16 vs. simdsort.*_Float16] -0.9209 -0.9208 75451 5967 75463 5975
[scalarsort.*_Float16 vs. simdsort.*_Float16] -0.9396 -0.9396 590828 35676 590792 35680
[scalarsort.*_Float16 vs. simdsort.*_Float16] -0.9524 -0.9524 14506782 689933 14506540 689943
[scalarsort.*_Float16 vs. simdsort.*_Float16] -0.9616 -0.9616 159229801 6113740 159217432 6113529
[scalarsort.*_Float16 vs. simdsort.*_Float16] -0.9827 -0.9827 1739044872 30113868 1738990462 30113349
[scalarsort.*_Float16 vs. simdsort.*_Float16] -0.9864 -0.9864 19075697953 259558909 19074535929 259512766
[scalarsort.*_Float16 vs. simdsort.*_Float16] -0.9210 -0.9209 5579 441 5582 442
[scalarsort.*_Float16 vs. simdsort.*_Float16] -0.9388 -0.9382 13171 806 13176 815
[scalarsort.*_Float16 vs. simdsort.*_Float16] -0.9377 -0.9372 28020 1746 28025 1761
[scalarsort.*_Float16 vs. simdsort.*_Float16] -0.9328 -0.9326 74876 5029 74879 5045
[scalarsort.*_Float16 vs. simdsort.*_Float16] -0.9469 -0.9469 585484 31094 585483 31107
[scalarsort.*_Float16 vs. simdsort.*_Float16] -0.9510 -0.9510 14291316 700636 14290814 700538
[scalarsort.*_Float16 vs. simdsort.*_Float16] -0.9608 -0.9608 156622769 6146373 156621600 6145706
[scalarsort.*_Float16 vs. simdsort.*_Float16] -0.9826 -0.9826 1729001307 30128303 1728922542 30127689
[scalarsort.*_Float16 vs. simdsort.*_Float16] -0.9210 -0.9210 803746 63496 803743 63504
[scalarsort.*_Float16 vs. simdsort.*_Float16] -0.9989 -0.9989 621695 656 621626 654
[scalarsort.*_Float16 vs. simdsort.*_Float16] -0.9131 -0.9130 742937 64595 742908 64605
[scalarsort.*_Float16 vs. simdsort.*_Float16]_pvalue 0.0315 0.0315 U Test, Repetitions: 20 vs 20
OVERALL_GEOMEAN -0.9595 -0.9595 0 0 0 0
Comparison to AVX512_FP16
Benchmark Time CPU Time Old Time New CPU Old CPU New
----------------------------------------------------------------------------------------------------------------------------------------------
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16] +0.6382 +0.6557 269 441 268 443
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16] +0.3411 +0.3500 604 810 605 816
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16] +0.7145 +0.7192 1014 1739 1017 1748
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16] +1.7644 +1.7643 2153 5951 2156 5959
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16] +1.8787 +1.8783 12278 35344 12282 35351
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16] +0.7758 +0.7759 385721 684977 385712 684986
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16] +0.3700 +0.3698 1368168 1874386 1368250 1874260
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16] +0.4479 +0.4479 5137910 7438989 5137603 7438745
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16] +0.1682 +0.1652 77774693 90857847 77385351 90173137
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16] +0.6533 +0.6521 267 441 268 442
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16] +0.3211 +0.3280 612 809 613 814
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16] +0.5179 +0.5210 1141 1732 1143 1738
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16] +1.6678 +1.6679 2200 5868 2202 5876
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16] +1.5668 +1.5672 12788 32824 12788 32830
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16] +0.7579 +0.7581 384179 675344 384134 675354
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16] +0.2080 +0.2081 1552580 1875503 1552298 1875373
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16] +0.4877 +0.4877 5052821 7516986 5052579 7516805
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16] +1.4458 +1.4462 25727 62922 25727 62934
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16] +0.5392 +0.5407 445 685 445 686
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16] +1.4454 +1.4455 25703 62854 25705 62862
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]_pvalue 0.5075 0.5075 U Test, Repetitions: 20 vs 20
OVERALL_GEOMEAN +0.7601 +0.7624 0 0 0 0