std.math.big: Implementation of assembly version of some functions and division improvement
A bigger pull request this time.
What this PR does:
- add x86_64 assembly implementations for
llaccumandllmulLimb(for both.addand.sub), with a special case forReleaseSmall. There are also versions ofllmulLimbfor the.addoperation if theadxcpu feature is supported (needed foradoxandadcxinstructions) - add generic versions of
llaccumandllmulLimbfor other architectures, with slight edits from the previous function to have a better codegen (forllmulLimb) - add tests for
llaccumandllmulLimb - change the algorithm (overall very similar) for the division, and prepare the function for future implementations of better algorithms for larger numbers (coming soonish, hopefully). There is also a special case when the divisor fits in one
Limb
I hope the addition of assembly version of these functions will be accepted, as I am not sure about the stance on assembly implementations in the std for functions that can be written without.
The current way I handle the assembly is suboptimal, and adding assembly for other architectures or other cpu features is cumbersome. It would probably be better to have a separate file for each architecture whenever these versions are added.
Also, my implementations (at least for llmulLimb) can still be improved.
Speed comparison on my machine:
Based on some quick tests for various size of randomly generated arrays of Limb (from 500 to 10000):
- the new
llaccumis ~5.5 times faster than the previous one for addition and ~4.3 times faster for subtraction - the new
llmulLimbis ~2.5 times faster than the previous one (with adx) while it is ~2.2 times faster for both addition (without adx) and subtraction.
The new generic version of llmulLimb is faster than the previous one but slower than the assembly one (between 30 to 60% slower depending on the operation and whether adx is supported)
I did not manage to get better codegen for llaccum, which is partly why I resorted to assembly.
The division part
note: a "2n by n division" mean a division of a bigint with 2n limbs by a bigint of n limbs. Likewise, a "n by 1 division" mean a division of a bigint with n limbs by a bigint of 1 limb. Also, "gmp (basecase division only)" is a gmp version where I only enabled the base algorithm. Once numbers get large enough, gmp switches to better division algorithms, so it is not a fair comparison, though I left default gmp in the graphs as a reference.
Division performance for 2n by n division is only improved by the new llmulLimb (leading to the same curve on the graph). The new division is ~2.2 times faster than the previous one, while being about 34% slower than gmp (basecase division only). Based on a few tests, this is due to gmp's better implementation of llmulLimb (which makes sense, since the algorithm used is otherwise near identical).
For n by 1 division, however, the new implementation is vastly superior, as it is ~6 times faster than the previous one, and about 32% slower than gmp. I got very similar graphs for n by 2 and n by 3 divisions (and haven't benched beyond). I don't know why the previous algorithm underperforms that much in these cases. Also, I haven't benched the now removed lldiv0p5, but the new implementation now doesn't care about half limbs and is probably faster.
Benching script here.
Should I start pinging jacobly each time I do a PR related to bigints ? I have at least 2 of them in preparation.
I have a better division algorithm ready. I initially planned to put it in a new PR, but I can also add it in this one. What should I do ?
I'm actually adding this new algorithm to the PR, since it is actually quite small.
The new algorithm uses a divide and conquer method, allowing it to leverage faster multiplication algorithms, and therefore being faster for larger numbers.
In the graph, the new algorithm is default_div (new), the one from this PR before the previous commit is default_div (prev) and the one currently in the std is default_div (old). The graph on the right is made with the same data, but only with the new algorithm and gmp.
Here is also an unbalanced division (5n by n instead of 2n by n):
~On a side note, I don't think any test can trigger the new code since it requires a divisor of at least recursive_division_threshold, which is currently 100.~ A test has been added in the latest commit.
A review of this PR would be appreciated.
This pull request is not ready for review because:
- It has conflicts that must be resolved via rebasing against latest origin/master.
- It is not passing the CI tests.
Closing since there has been no activity for 30+ days.