Added SVE intrinsics for postGemmPart function
This pull request adds SVE-based implementations of postGemmPart function for both float and double types to accelerate vectorized computation on ARM.
Average Performance (on Graviton3)
Float: ~4.3× speedup over scalar Double: ~1.19× speedup over scalar
PR should start as a draft, then move to ready for review state after CI is passed and all applicable checkboxes are closed. This approach ensures that reviewers don't spend extra time asking for regular requirements.
You can remove a checkbox as not applicable only if it doesn't relate to this PR in any way. For example, PR with docs update doesn't require checkboxes for performance while PR with any change in actual code should have checkboxes and justify how this code change is expected to affect performance (or justification should be self-evident).
Checklist to comply with before moving PR from draft:
PR completeness and readability
- [ ] I have reviewed my changes thoroughly before submitting this pull request.
- [ ] I have commented my code, particularly in hard-to-understand areas.
- [ ] I have updated the documentation to reflect the changes or created a separate PR with update and provided its number in the description, if necessary.
- [ ] Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
- [ ] I have added a respective label(s) to PR if I have a permission for that.
- [ ] I have resolved any merge conflicts that might occur with the base branch.
Testing
- [ ] I have run it locally and tested the changes extensively.
- [ ] All CI jobs are green or I have provided justification why they aren't.
- [ ] I have extended testing suite if new functionality was introduced in this PR.
Performance
- [ ] I have measured performance for affected algorithms using scikit-learn_bench and provided at least summary table with measured data, if performance change is expected.
- [ ] I have provided justification why performance has changed or why changes are not expected.
- [ ] I have provided justification why quality metrics have changed or why changes are not expected.
- [ ] I have extended benchmarking suite and provided corresponding scikit-learn_bench PR if new measurable functionality was introduced in this PR.
@shubhamsvc I haven't checked all the places where this is used, but I think the 'exp' function as used in RBF kernels requires high-precision computations - otherwise quality of the results might degrade significantly in some situations (particularly Newton-type methods that rely on those computations).
The exp function from MKL that gets called on x86 is called with the 'high accuracy' mode as you can see here: https://github.com/uxlfoundation/oneDAL/blob/13c979bb596a4fe06864627e6456fc63fbbc04f5/cpp/daal/src/externals/service_math_mkl.h#L126
Could you provide some information about the accuracy level of the 'exp' function here? Is there some reference paper analyzing the method?
@Vika-F Could you provide some info about which algorithms use this function?
@david-cortes-intel @shubhamsvc
@Vika-F Could you provide some info about which algorithms use this function?
Besides RBF this function is used in GBT, prediction stage; but the accuracy requirements are lower there I guess.
vExp is also used in EM GMM, AdaBoost and LogitBoost algorithms, but those are not part of sklearn-intelex, the importance of those algorithms is lower.
@david-cortes-intel @shubhamsvc
@Vika-F Could you provide some info about which algorithms use this function?
Besides RBF this function is used in GBT, prediction stage; but the accuracy requirements are lower there I guess.
vExp is also used in EM GMM, AdaBoost and LogitBoost algorithms, but those are not part of sklearn-intelex, the importance of those algorithms is lower.
But to clarify: the PR is not modifying the 'vExp' function in the '_ref' service file, it's just modifying it in this particular file (for RBF kernel).
I understand RBF is used in SVMs (not sure if the algorithm there degrades with lower-precision exp), and might be used in the future in spectral clustering (I guess lower exp precision shouldn't be much of an issue there), but is there some other place where these RBF kernels might be called?
@david-cortes-intel Sorry, there was a misunderstanding. No RBF is not used anywhere in oneDAL except SVM now.
But potentially it can be used in any algorithm that can benefit from kernels (we just have only SVM for now). Sklearn has kernel ridge regression, for example, and maybe something else.
@shubhamsvc And I also think that it would be beneficial to have this exp code in vExp or other similar primitive (in case its accuracy is greater than 0.5 ULP) to be able to reuse it in other algorithms. The code of corresponding vExp is located here: https://github.com/uxlfoundation/oneDAL/blob/main/cpp/daal/src/externals/service_math_ref.h#L224
@david-cortes-intel @Vika-F Thank you for quick response. What ULP accuracy is expected for exp(x) in this case?
For the general vExp function which is used also in logistic regression, I don't think it'd be advantageous to use something with more than 1ULP of error even if much faster, as numerical inaccuracies there do have a noticeable effect on convergence speed and quality of results.
For SVM RBF kernel in specific I am not sure - perhaps someone more familiar with the underlying algorithm could comment. Nevertheless, would still be ideal to know what's the accuracy level of this 'exp' function, at the very least to leave it as a comment in the code.
Perhaps one potential next step could be to conduct tests with RBF SVMs using the sklbench repository (which requires sklearnex built against this oneDAL branch) before and after this PR and see how the quality metrics change. @Alexsandruss Could you provide instructions for running the SVM RBF cases from sklbench? @rakshithgb-fujitsu Could you perhaps try out the changes before/after this PR with SVM RBF kernel examples in ARM hardware?
To support the importance of RBF, it is already exposed in sklearnex in the onedal module, https://github.com/uxlfoundation/scikit-learn-intelex/blob/main/onedal/primitives/kernel_functions.py#L90 and could easily become a publicly-usable sklearnex function in short order (replicating sklearn functionality: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.rbf_kernel.html)
I added the PR checklist to guide reviewing/development.
For the general vExp function which is used also in logistic regression, I don't think it'd be advantageous to use something with more than 1ULP of error even if much faster, as numerical inaccuracies there do have a noticeable effect on convergence speed and quality of results.
For SVM RBF kernel in specific I am not sure - perhaps someone more familiar with the underlying algorithm could comment. Nevertheless, would still be ideal to know what's the accuracy level of this 'exp' function, at the very least to leave it as a comment in the code.
Perhaps one potential next step could be to conduct tests with RBF SVMs using the sklbench repository (which requires sklearnex built against this oneDAL branch) before and after this PR and see how the quality metrics change. @Alexsandruss Could you provide instructions for running the SVM RBF cases from sklbench? @rakshithgb-fujitsu Could you perhaps try out the changes before/after this PR with SVM RBF kernel examples in ARM hardware?
Yes, Shubham is from Fujitsu as well. We'll share all the benchmark details and validate it on arm hardware.
@david-cortes-intel The current SVE implementation of exp had low accuracy, so it has been removed in this PR. I am working on improving the ULP accuracy of exp and will include an updated version in a subsequent PR.
CI failures do not look related to this PR, but @shubhamsvc please lint the files according to the instructions and merge the main branch here.