[SYSTEMDS-3544] Boolean Matrix Support
Thanks @louislepage, for starting the project. Please add the Apache license to the new file, DenseBlockTrueBool.java. Looking forward to seeing how you use this new denseblock for the binary matrix operations.
- added rudimentary support for true boolean operations by checking the result MatrixBLock and using boolean arithmetic accordingly.
- java program that uses the systems jar to run operators and compute metrics can be found here: https://github.com/louislepage/systemds-boolean-matrix-evaluation
First results from25 runs of 10k x 10k sized double matrices show an average speed-up:
Matrix Dimensions: 10000 x 10000
Operator: GreaterThan
Runtime statistics in ms:
FP64:
LongSummaryStatistics{count=25, sum=60166, min=1712, average=2406,640000, max=3633}
BOOLEAN:
LongSummaryStatistics{count=25, sum=54211, min=1742, average=2168,440000, max=2521}
TRUE_BOOLEAN with boolean arithmetics:
LongSummaryStatistics{count=25, sum=33950, min=1009, average=1358,000000, max=1931}
Note, that the BOOLEAN DenseBlock uses Bitset, but simple boolean arrays as in the TRUE_BOOLEAN case, seem to perform better, at least regarding runtime.
Further Steps
- more tests with different comparison operators
- test memory efficiency
- improve implementation of differentiation between boolean arithmetic and numeric/double arithmetics
Thanks for the commits @louislepage. I have a few questions regarding the experiment setup. How are you running and benchmarking the experiments? Are you running a dml script with time()? How did you generate the input matrices? Are they dense or sparse?
Thanks for the feedback @phaniarnab !
Since the boolean denseblocks are in an early stage and not yet usable via DML, I wrote a simple java program that creates the matrices and runs the operators on them. It runs the operators on two randomly generated double matrices of a predefined size. In my example, I took the average of 25 computations with random matrices. The time was taken using javas Instant.now() and then calculated as milli seconds, which were stored in a LongSummaryStatistics object to compute averages etc.:
Instant start = Instant.now();
mb1.binaryOperations(op, mb2, ret);
Instant finish = Instant.now();
long timeElapsed = Duration.between(start, finish).toMillis();
timeStatistics.accept(timeElapsed);
As described here, the tool can be found in this repo. It should be public.
The main reason for this approach is that I did not yet want to add full support for boolean matrices since I can not tell if that might break things in other places. However, I think it should be fine to use boolean matrices whenever running unsafeBinary with operators that are Comparision functions. If the data of dense blocks are only accessed using the get() functions, we can always return the double versions of the boolean values there to maintain compatibility with the FP64 implementation. To access the values as boolean in the future, I added a getBoolean() function to the boolean denseblocks.
Let me know if I should add this support for boolean matrices by using translations to doubles in the get() method to maintain compatibility.
But first I should test which kind of boolean implementation performs better, boolean array or bitset.
Measuring time in Java is fine. However, I recommend using System.nanoTime() instead.
Use boolean/bitset inputs of different sizes.
New file, new missing license, sorry... Will fix asap!
Experiment:
To analyze and compare the performance of DenseBlockBool as Bitset and as BooleanArray to the current FP64 implementation, a java program was created that uses the classes from systemds (systemds-boolean-matrix-evaluation). For each given matrix size, it does five computations with pseudo-randomly generated matrices for double to boolean operators and for boolean to boolean operators. Both the total runtime of each operation and the memory used are stored. Finally, the average over the five runs is taken. The Matrix sizes are chosen to represent up to 16mio total entries, but with different dims. We used square matrices of up to 4000x4000 and non-square ones of up to 100x160000. In the non-square case, one dimension was always fixed to a length of 100.
Evaluation of Double to Bool (GreaterThan)
As input, two dense FP64 Matrices are randomly generated based on a given seed. As the output, a MatrixBlock with a DenseBlock of the desired type (FP64, Bitset, or BooleanArray) is created. The timer is started, the operator called with the two input and the output matrices, the timer is stopped, and the result matrix is returned for validation. Memory is computed as the difference between the memory used before the in- and output matrices are created and after the result is computed (but before any Matrix blocks were dereferenced).
The plot shows the runtimes on the left and memory usage on the right. In blue, the results of the bitset implementation are plotted. In green is the current Double/FP64 implementation, and in orange is the Boolean Array implementation. The different markers represent the results of square (square marker), Nx100 (circle), and 100xN (cross) matrices.
As expected, both runtime and memory usage increase linearly with an increased number of entries in the matrix.
The results show a visible performance increase for computation regarding runtime when using BooleanArray. This is most likely due to faster, direct access to the smaller arrays and fewer translations between boolean results and their double representation to store them in the result matrix since the operators can return boolean values, which can then be directly stored in the result matrices.
Interestingly the boolean array outperforms the Bitset here. The Bitset was even slower than the FP64 implementation. Without further investigation, this is probably due to fewer instructions necessary to access the boolean array compared to more when storing a value in a bitset. Also, it is worth noting that the bitset stores the values internally as long and therefore has to access an 8bytes large long whenever accessing a value. This might influence caching.
Regarding memory, both boolean implementations took less space. As expected, the Bitset takes even less space since each boolean in the boolean arrays still takes up one byte, but the bitset stores the booleans as individual bits that are collected into long.
There is no difference in runtime or memory when using non-square matrices as long as the total number of entires is the same.
Evaluation of Boolen To Boolean (Xor)
As input, two dense Matrices of the given type (FP64, Bitset, or BooleanArray) with boolean values are randomly generated based on a given seed. As the output, a MatrixBlock with a DenseBlock of the desired type is created. The timer is started, and the operator is called with the two input and the output matrices in a loop for 20 iterations. The timer is stopped, and the result matrix is returned for validation. Memory is computed as the difference between the memory used before the in- and output matrices are created and after the loop ended (but before any Matrix blocks were dereferenced).
The plot shows the runtimes on the left and memory usage on the right. In blue, the results of the bitset implementation are plotted. In green is the current Double/FP64 implementation, and in orange is the Boolean Array implementation. The different markers represent the results of square (square marker), Nx100 (circle), and 100xN (cross) matrices.
As expected, an improved runtime for the boolean arrays is also visible in this case and both runtime and memory usage increase linearly with an increased number of entries in the matrix.
We can also see an improvement in memory usage when using a boolean implementation. The memory used is less than when running the Double to Boolean operator, even when using FP64, but this is probably due to reusing the result matrix as input to the next computation. Therefore after the first iteration, there are only two matrices in memory.
Memory usage for the Boolean to Boolean case, when using actual Booleans as input, has decreased significantly since all matrices in memory are small, not only the output matrix. It is also visible that the bitset again outperforms the boolean array regarding memory usage.
There is no difference in runtime or memory when using non-square matrices as long as the total number of entries is the same.
Conclusion:
Since both boolean implementations improve memory usage, but only the boolean array outperforms the FP64 implementation in runtime, I suggest using the boolean array as the DenseBlock for any operation that results in only boolean data.
Let me know if there are any further open questions.
Have you tried pop count on the bit set for the sum operation? This should speed up the bit set significantly for sum.
Have you tried pop count on the bit set for the sum operation? This should speed up the bit set significantly for sum.
I did look into how some operators would probably have a speedup when using bitset because they can be performed on the entire bitset and not value by value.
OR, AND, and XOR are also candidates for this (e.g. BitSet.or(BitSet bitset).
But, it would mean a reimplementation of all of these operators. And a mechanism to decide when to use value by value and when Bitset on Bitset.
However, the GraterThan and other Comparison Functions were already slower for bitset , than when run on FP64, and these have to run value by value. This would mean that one would need a dynamic typing mechanism to decide during runtime which kind of boolean denseblock would be most beneficial for the upcoming operations. I doubt that translating from boolean array to bitset and vice-versa for each operation would be worth it.
@phaniarnab and I discussed this briefly in a meeting but came to the result, that in the end, there should be only one kind of boolean array implementation for now.
Therefore I proposed the boolean array since it seems to have some speedup over all operators while using the same value-by-value operations as the FP64 implementation did. It would provide a speedup while being backward-compatible with the current operator implementation, so to say.
But I agree that there most certainly are situations where bitset could be significantly faster.
I think that having both boolean implementations and deciding at runtime what should be used would be very interesting, but for now, I do not understand systemds well enough to confidently say if this would be feasible and how complex this would be.
Some changes from main were throwing errors for the overflow tests and, following the contributing guidelines, I had to rebase on the main again and ran into conflicts. I resolved them but had to force-push this branch to accommodate these changes.
Can you restart the pipeline once? I think a couple of errors are due to issues with spark and caching and might not be directly code related.
I cannot replicate the errors locally and do not understand how the changes in this branch could have changed anything regarding privacy handling or spark execution.
If you do see an issue with tre test results that might be code dependent, I would be very grateful for any tips. Thanks!
Can you restart the pipeline once? I think a couple of errors are due to issues with spark and caching and might not be directly code related.
I cannot replicate the errors locally and do not understand how the changes in this branch could have changed anything regarding privacy handling or spark execution.
If you do see an issue with tre test results that might be code dependent, I would be very grateful for any tips. Thanks!
Yes. They may not be related to your changes.
Thanks, now it was successful! And with this, I do not have anything else to add to the project. Let me know if you need further information on the code and experiments.
Anyways, I still see the same
org.apache.sysds.runtime.controlprogram.federated.FederatedWorkerHandlerException: Failed to getVariable
and
Exception : class io.netty.channel.AbstractChannel$AnnotatedConnectException
Message : Connection refused: localhost/127.0.0.1:35013
LEVEL : 3
Exception : class java.net.ConnectException
Message : Connection refused
I do see the same errors in other successful jobs: https://github.com/apache/systemds/actions/runs/5478347246/jobs/9978929681?pr=1850
Now I am interested in how these tests are designed... 😅
Thanks @louislepage. The plots conclude this task. Thanks for your contribution. 👍🏽 If there are any scripts that are needed to reproduce the plots, please share those with us.
Great, thanks for your mentoring!
The jupyter notebook and result CSVs with which I created the plots can be found under louislepage/systemds-boolean-matrix-evaluation/python-eval in the repo where you can also find the java code that I used to gather the data.
This feature needs more thought and testing before it is ready for merging. I will leave it for the next release. @j143
Excelente