zig icon indicating copy to clipboard operation
zig copied to clipboard

std.math.big: Implementation of assembly version of some functions and division improvement

Open samy-00007 opened this issue 9 months ago • 4 comments

A bigger pull request this time.

What this PR does:

  • add x86_64 assembly implementations for llaccum and llmulLimb (for both .add and .sub), with a special case for ReleaseSmall. There are also versions of llmulLimb for the .add operation if the adx cpu feature is supported (needed for adox and adcx instructions)
  • add generic versions of llaccum and llmulLimb for other architectures, with slight edits from the previous function to have a better codegen (for llmulLimb)
  • add tests for llaccum and llmulLimb
  • change the algorithm (overall very similar) for the division, and prepare the function for future implementations of better algorithms for larger numbers (coming soonish, hopefully). There is also a special case when the divisor fits in one Limb

I hope the addition of assembly version of these functions will be accepted, as I am not sure about the stance on assembly implementations in the std for functions that can be written without. The current way I handle the assembly is suboptimal, and adding assembly for other architectures or other cpu features is cumbersome. It would probably be better to have a separate file for each architecture whenever these versions are added. Also, my implementations (at least for llmulLimb) can still be improved.

Speed comparison on my machine:

Based on some quick tests for various size of randomly generated arrays of Limb (from 500 to 10000):

  • the new llaccum is ~5.5 times faster than the previous one for addition and ~4.3 times faster for subtraction
  • the new llmulLimb is ~2.5 times faster than the previous one (with adx) while it is ~2.2 times faster for both addition (without adx) and subtraction.

The new generic version of llmulLimb is faster than the previous one but slower than the assembly one (between 30 to 60% slower depending on the operation and whether adx is supported) I did not manage to get better codegen for llaccum, which is partly why I resorted to assembly.

The division part

note: a "2n by n division" mean a division of a bigint with 2n limbs by a bigint of n limbs. Likewise, a "n by 1 division" mean a division of a bigint with n limbs by a bigint of 1 limb. Also, "gmp (basecase division only)" is a gmp version where I only enabled the base algorithm. Once numbers get large enough, gmp switches to better division algorithms, so it is not a fair comparison, though I left default gmp in the graphs as a reference.

Division performance for 2n by n division is only improved by the new llmulLimb (leading to the same curve on the graph). The new division is ~2.2 times faster than the previous one, while being about 34% slower than gmp (basecase division only). Based on a few tests, this is due to gmp's better implementation of llmulLimb (which makes sense, since the algorithm used is otherwise near identical).

For n by 1 division, however, the new implementation is vastly superior, as it is ~6 times faster than the previous one, and about 32% slower than gmp. I got very similar graphs for n by 2 and n by 3 divisions (and haven't benched beyond). I don't know why the previous algorithm underperforms that much in these cases. Also, I haven't benched the now removed lldiv0p5, but the new implementation now doesn't care about half limbs and is probably faster.

graph

Benching script here.

samy-00007 avatar May 10 '25 16:05 samy-00007

Should I start pinging jacobly each time I do a PR related to bigints ? I have at least 2 of them in preparation.

samy-00007 avatar May 11 '25 19:05 samy-00007

I have a better division algorithm ready. I initially planned to put it in a new PR, but I can also add it in this one. What should I do ?

samy-00007 avatar May 13 '25 21:05 samy-00007

I'm actually adding this new algorithm to the PR, since it is actually quite small.

The new algorithm uses a divide and conquer method, allowing it to leverage faster multiplication algorithms, and therefore being faster for larger numbers.

In the graph, the new algorithm is default_div (new), the one from this PR before the previous commit is default_div (prev) and the one currently in the std is default_div (old). The graph on the right is made with the same data, but only with the new algorithm and gmp. 2n by n division

Here is also an unbalanced division (5n by n instead of 2n by n): 5n by n division

~On a side note, I don't think any test can trigger the new code since it requires a divisor of at least recursive_division_threshold, which is currently 100.~ A test has been added in the latest commit.

samy-00007 avatar May 16 '25 19:05 samy-00007

A review of this PR would be appreciated.

samy-00007 avatar May 25 '25 07:05 samy-00007

This pull request is not ready for review because:

  • It has conflicts that must be resolved via rebasing against latest origin/master.
  • It is not passing the CI tests.

Closing since there has been no activity for 30+ days.

andrewrk avatar Nov 27 '25 02:11 andrewrk