Speed up the `map()` and `forEach()` functions in DenseMatrix.js by over 300%
I optimized the map() and forEach() functions in DenseMatrix.js to make them run over 300% faster!
There are multiple optimization tricks I used in conjunction to achieve this:
- Using an iterative approach instead of an inefficient recursive approach by using an index marching algorithm
- Duplicating the matrix into the result matrix before applying the callback function to each of the elements
- Not using the helper
.get()and.set()functions - Inlining the index marcher
- Keeping a reference to each dimension of the matrix for faster access to the matrix's elements
- Using a simple for loop when accessing the last dimension of the matrix to reduce index marches. This also means that there is no more need to store a reference to the last dimension of the matrix.
- Using a
Uint32Array()for the index instead of a regularList
Credits and many thanks to @RandomGamingDev for coming up with the algorithms used for this optimization!
This PR was created in response to this: https://github.com/josdejong/mathjs/pull/3149#pullrequestreview-1885544599
I didn't find any performance issues with most functions or operators. I noticed though that the .map() method of DenseMatrix is relatively slow. This is a off topic I think, we should address that in a separate PR (if there is room for improvement). In any case, I would like to proceed with this PR.
Thanks Jmar, that sounds awesome! I'll look into your PR next week.
I've added a benchmark to test the performance of the map and forEach method, see https://github.com/josdejong/mathjs/commit/88a4b35e9d7bbd4bc2e2bdaba5c20d3da702d870. Running that I see that the map method is about 15 times faster which is incredible! The forEach method is about 1.5 times faster. This looks really promising!
A number of feedbacks:
- I prefer keeping the API the same: passing a plain and simple
Arrayasindex. It would be a breaking change, and I think it would be a bit confusing sinceUint32Arrayis not a commonly used type. So I prefer replacing theUint32Arrayarray with a regularArrayagain. I'm curious to see if that deteriorates the performance a lot again (I expect not). - The
indexwas immutable, and in this PR it is mutated every time. This is great for performance, but can cause weird errors when storing these indexes for use afterwards. I prefer the API to be "worry free" by keeping the indices immutable. This means we need to create a new array every time we call the callback. Can you give that a try and see how much impact that has on the performance? Also: we'll need to add unit tests that verify that the index is not mutated. - The code in
mapandforEachis duplicated. Can we do something about that and reuse the logic?
I just realize there is another PR putting effort in improving map and forEach: #3256. Let's make sure to align/unify the efforts.
Hi @josdejong, I used the benchmark that you've added and was surprised to see that changing it back to an Array became faster than using a Uint32Array. That result combined with the benefit of keeping the API the same made it the better option.
Array vs Uint32Array -- Array is ~231% faster in forEach() and ~101% faster in map()
Although in a separate benchmark done by @RandomGamingDev, he made a simple stress test that showed Uint32Arrays significantly outperforming simple Arrays: https://editor.p5js.org/PotatoBoy/sketches/nU-dJfVKN
Next, I changed way I cloned the index array, going from using [...index] to index.slice(0) which was significantly faster.
[...index] vs index.slice(0) -- Using index.slice(0) is ~110% faster in map() and ~123% faster in forEach()
Finally, I tested to see how passing a immutable index array to the callback function in map() would affect its performance. I found that cloning the index array deteriorated the speed by about HALF of its mutable counterpart. I only tested this in map() because using a mutable index in forEach() results in errors when running npm test.
Mutable index vs Cloned index using index.slice(0) (map() only) -- Using mutable index is ~161% faster
As for dealing with the duplicated logic, I am currently experimenting with different ways of refactoring the code in ways that do not significantly deteriorate the performance. Simply calling forEach() inside map() is not feasible because it would require using the matrix's helper .set() function which is slow because it does not use the optimization trick of keeping track of each dimension of the matrix for faster access to each element.
Performance impact of naively using .set() inside map() to resolve duplicated logic -- Naive method is ~455% slower
I just realize there is another PR putting effort in improving
mapandforEach: #3256. Let's make sure to align/unify the efforts.
Hi Jos and Jmar, for #3256 I intentionally left out any modifications on the recurse function that might affect speed due to this PR. The only significant change is that I moved the recurse function to the src/utils/array.js method so that it would be taken when needed by other functions (map, forEach, dense.map, dense.forEach), taking into account Jos's comments on this PR.
The main speed benefit in #3256 is the reduced callback functions and the rest of the changes that I have planned is just to reduce repetition of code on the transform functions by making a utility that transforms callbacks and multiple callbacks.
For this PR my recommendation would be to also take into account speed improvements in arrays like in map(array, callback) and forEach(array, callback). At the moment I'm not sure if it's possible to include these benefits also in the multiple callback territory like map(array1, array2, callback(v1, v2)).
Awesome work on the speed improvements on this PR!
Thanks for the updates @Galm007 , this sounds promising! It makes sense that cloning of the index causes a performance hit, but I think it is necessary, and I'm glad to see it's not that bad.
@dvd101x indeed JMar's PR (this one, #3251) and yours (#3256) tackle different performance bottlenecks on the same two methods, which is great. I expect some merge conflicts though, so good to be aware of the two initiatives😅.
@Galm007 I've just merged #3256 containing performance improvements in map, filter, and forEach. This gives some merge conflicts. Please let me know if you need help fixing the conflicts.
There are now also a new new benchmarks in test/benchmark/map.js and test/benchmark/forEach.js that you can use to test the improvements that you're working on in this PR, that may be helpful.
Hi @josdejong :) I've ultimately decided to simply put the duplicated logic in a function named _forEach() which is used by the map() and forEach() functions. Since javascript does not pass variables by reference, I made _forEach()'s callback return an array and index. This one indirection has negligible performance on the overall impact and fixes the issue of duplicated logic. Maybe this can also open up the possibility of replacing the helper recurse() function.
@Galm007 any chance to look at my feedback? Your PR is almost finished, would be great to have it merged :)
Hi @josdejong, thanks for being patient with me, there were personal circumstances that popped up which I had to focus on for the moment. I resolved the feedback suggestions you gave me and it is great that the performance gains still stayed significantly high even after changing the _forEach() function to using immutable index arrays. Have an awesome day!
Thanks for the updates Jmar. The delay is no problem at all, please take care of personal life first!
I ran the /test/benchmark/map.js test before and after this PR, the performance improvement is huge! 😎
# Before
abs(genericMatrix) x 51,779 ops/sec ±0.40% (97 runs sampled)
abs(array) x 586,908 ops/sec ±0.51% (95 runs sampled)
abs(numberMatrix) x 47,169 ops/sec ±0.95% (95 runs sampled)
genericMatrix.map(abs) x 40,415 ops/sec ±0.69% (94 runs sampled)
numberMatrix.map(abs) x 38,059 ops/sec ±0.48% (94 runs sampled)
map(genericMatrix, abs) x 40,118 ops/sec ±0.47% (97 runs sampled)
map(numberMatrix, abs) x 37,813 ops/sec ±0.49% (95 runs sampled)
map(array, abs) x 45,788 ops/sec ±0.89% (93 runs sampled)
map(array, abs.signatures.number) x 67,965 ops/sec ±0.48% (95 runs sampled)
genericMatrix.map(abs.signatures.number) x 55,406 ops/sec ±0.65% (96 runs sampled)
numberMatrix.map(abs.signatures.number) x 51,242 ops/sec ±0.47% (95 runs sampled)
# After
abs(genericMatrix) x 349,299 ops/sec ±0.68% (91 runs sampled)
abs(array) x 651,351 ops/sec ±0.68% (96 runs sampled)
abs(numberMatrix) x 350,675 ops/sec ±0.64% (93 runs sampled)
genericMatrix.map(abs) x 153,168 ops/sec ±0.54% (95 runs sampled)
numberMatrix.map(abs) x 151,669 ops/sec ±0.80% (94 runs sampled)
map(genericMatrix, abs) x 150,948 ops/sec ±0.65% (92 runs sampled)
map(numberMatrix, abs) x 147,245 ops/sec ±0.52% (97 runs sampled)
map(array, abs) x 49,053 ops/sec ±0.68% (94 runs sampled)
map(array, abs.signatures.number) x 71,779 ops/sec ±0.52% (94 runs sampled)
genericMatrix.map(abs.signatures.number) x 402,618 ops/sec ±0.48% (94 runs sampled)
numberMatrix.map(abs.signatures.number) x 400,955 ops/sec ±0.63% (90 runs sampled)
Merging the PR now.
Published now in v13.2.0, thanks again!