oneMKL icon indicating copy to clipboard operation
oneMKL copied to clipboard

Adding rocBLAS support for BLAS domain through Intel DPCPP compiler with HIP backend support

Open TejaX-Alaghari opened this issue 3 years ago • 11 comments

Description

This PR is intended to extend the HIP backend of BLAS domain, which is currently supported for hipSYCL compiler to Intel’s DPC++/SYCL compiler.

Checklist

All Submissions

  • [x] Do all unit tests pass locally? Test Cases.log
  • [x] Have you formatted the code using clang-format?

New features

  • [x] Have you provided motivation for adding a new feature?
  • [x] Have you added relevant tests?

TejaX-Alaghari avatar Apr 25 '22 15:04 TejaX-Alaghari

@TejaX-Alaghari Thank you for this PR! I have built and successfully tested it on an application containing SYCL code. Just had to adjust the following line in FindrocBLAS.cmake to IMPORTED_LOCATION "${HIP_PATH}/../rocblas/lib/librocblas.so" in order to handle non-standard ROCM paths. Could you please incorporate this small fix before the PR gets merged? Also filed a related issue on the intel llvm page.

ajaypanyala avatar May 03 '22 03:05 ajaypanyala

Thanks for the suggestion @ajaypanyala. I've added a commit for that change. Do let me know if there are any other suggestions.

TejaX-Alaghari avatar May 05 '22 16:05 TejaX-Alaghari

@TejaX-Alaghari @mmeterel I get a link error in the final step.

[100%] Linking CXX shared library ../../../../lib/libonemkl_blas_rocblas.so
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: cannot find -lclang_rt.builtins-x86_64

Not sure if something needs to be fixed in this file.

For now, I got it to work by manually adding -L$ROCM_PATH/llvm/lib/clang/13.0.0/lib/linux to the link line where libclang_rt.builtins-x86_64.a exists. The link step tries to find clang_rt.builtins-x86_64 in the folder $INTEL_LLVM_INSTALL_PATH/lib/clang/15.0.0/lib/linux which doesn't exist. Only the include folder $INTEL_LLVM_INSTALL_PATH/lib/clang/15.0.0/include exists. I am building the intel-llvm compiler as described here

[EDIT] This issue probably belongs in the intel-llvm repo. There is a related issue that was just opened.

ajaypanyala avatar May 12 '22 02:05 ajaypanyala

@TejaX-Alaghari @mmeterel I get a link error in the final step.

[100%] Linking CXX shared library ../../../../lib/libonemkl_blas_rocblas.so
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: cannot find -lclang_rt.builtins-x86_64

Not sure if something needs to be fixed in this file.

For now, I got it to work by manually adding -L$ROCM_PATH/llvm/lib/clang/13.0.0/lib/linux to the link line where libclang_rt.builtins-x86_64.a exists. The link step tries to find clang_rt.builtins-x86_64 in the folder $INTEL_LLVM_INSTALL_PATH/lib/clang/15.0.0/lib/linux which doesn't exist. Only the include folder $INTEL_LLVM_INSTALL_PATH/lib/clang/15.0.0/include exists. I am building the intel-llvm compiler as described here

[EDIT] This issue probably belongs in the intel-llvm repo. There is a related issue that was just opened.

@ajaypanyala @TejaX-Alaghari Thanks for the detailed explanation. I have seen the same issue earlier. I am leaning towards waiting for LLVM issue to be fixed, so we don't have to recommend users to use temporary hack. What do you think?

mmeterel avatar May 13 '22 15:05 mmeterel

@TejaX-Alaghari @mmeterel I get a link error in the final step.

[100%] Linking CXX shared library ../../../../lib/libonemkl_blas_rocblas.so
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: cannot find -lclang_rt.builtins-x86_64

Not sure if something needs to be fixed in this file. For now, I got it to work by manually adding -L$ROCM_PATH/llvm/lib/clang/13.0.0/lib/linux to the link line where libclang_rt.builtins-x86_64.a exists. The link step tries to find clang_rt.builtins-x86_64 in the folder $INTEL_LLVM_INSTALL_PATH/lib/clang/15.0.0/lib/linux which doesn't exist. Only the include folder $INTEL_LLVM_INSTALL_PATH/lib/clang/15.0.0/include exists. I am building the intel-llvm compiler as described here [EDIT] This issue probably belongs in the intel-llvm repo. There is a related issue that was just opened.

@ajaypanyala @TejaX-Alaghari Thanks for the detailed explanation. I have seen the same issue earlier. I am leaning towards waiting for LLVM issue to be fixed, so we don't have to recommend users to use temporary hack. What do you think?

@mmeterel Sounds good. That would be perfect! Thanks!

ajaypanyala avatar May 13 '22 18:05 ajaypanyala

@TejaX-Alaghari @mmeterel I get a link error in the final step.

[100%] Linking CXX shared library ../../../../lib/libonemkl_blas_rocblas.so
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: cannot find -lclang_rt.builtins-x86_64

Not sure if something needs to be fixed in this file. For now, I got it to work by manually adding -L$ROCM_PATH/llvm/lib/clang/13.0.0/lib/linux to the link line where libclang_rt.builtins-x86_64.a exists. The link step tries to find clang_rt.builtins-x86_64 in the folder $INTEL_LLVM_INSTALL_PATH/lib/clang/15.0.0/lib/linux which doesn't exist. Only the include folder $INTEL_LLVM_INSTALL_PATH/lib/clang/15.0.0/include exists. I am building the intel-llvm compiler as described here [EDIT] This issue probably belongs in the intel-llvm repo. There is a related issue that was just opened.

@ajaypanyala @TejaX-Alaghari Thanks for the detailed explanation. I have seen the same issue earlier. I am leaning towards waiting for LLVM issue to be fixed, so we don't have to recommend users to use temporary hack. What do you think?

@mmeterel : According to the compiler team, SYCL compiler doesn't use clang-builtins library. It uses libgcc_s.so instead. In this case, the dependency for libclang_rt is from ROCM blas library. The clang version of libclang_rt should match the ROCM library built version. It suggests that adding -L$ROCM_PATH/llvm/lib/clang/13.0.0/lib/linux is the right permanent solution.

pengtu avatar Aug 11 '22 18:08 pengtu

@TejaX-Alaghari Sorry for the delay in getting back to this. I have a few follow-up requests and questions about this PR.

  1. Could you please update each conversation with your feedback?
  2. I left 7 comments on May 11. Could you please respond to each of these comments with the status? Let's do our best to have every question/comment answered.

mmeterel avatar Sep 18 '22 20:09 mmeterel

@TejaX-Alaghari I tested this PR on an AMD system with ROCM version 5.0.0 and card version gfx90a. LLVM with HIP backend is built from this commit: 66361038b63caaae566fc9648f5da50b74222b83

I am getting the warning/errors/crashes as shown in the attached files. I have tested only Level1 APIs for simplicity.

Could you please advise how to resolve them? Feel free to schedule a meeting to discuss in detail.

amd_llvm_level1.txt amd_llvm_level1_verbose.txt

mmeterel avatar Sep 18 '22 20:09 mmeterel

Thanks for the PR. Looking forward to add this feature.

I added some initial feedback. General comments:

  1. Please check if the comments added for cuBLAS backend still makes sense here as well. And if yes, update them accordingly.
  2. Please use sycl:: namespace throughout the project.
  3. Please make sure header guards match header names.
  4. Please test rocBLAS backend with hipSYCL so make sure nothing is broken. Update logs.
  5. I have some concerns about Copyrights - will add my feedback later ==> feedback added
  6. Please update README and docs folder as needed.
  7. In the test logs, please test MKLCPU backend as well. Also, I see segfault errors. Can you fix them?

Thanks for the detailed feedback on the PR. Please find below, the response for above comments:

  1. The existing comments are relevant for the changes introduced in this PR
  2. All the earlier "cl::sycl::"namespaces have been updated to "sycl::"
  3. All header guards match their respective file names
  4. TBD
  5. Updated the copyrights per the suggestion for newly introduced files
  6. Updated README and docs with the build configuration for rocBLAS backend through clang++
  7. We haven't observed any issues while testing MKLCPU backend for BLAS with "ctest"

TejaX-Alaghari avatar Sep 21 '22 13:09 TejaX-Alaghari

Tested the PR again after latest commits. Still seeing major failures. Please reach out for a debug session.

mmeterel avatar Sep 21 '22 18:09 mmeterel

Tested the PR again after latest commits. Still seeing major failures. Please reach out for a debug session.

After discussing with @TejaX-Alaghari offline, this issue is most likely due to missing synchronization. Similar issue was fixed in #228 for cuBLAS backend. @TejaX-Alaghari will work on updating this PR. Please note that, similar updates will likely needs to be done for rocRAND (#218 ) and rocSOLVER (#208 ) PRs.

mmeterel avatar Sep 21 '22 19:09 mmeterel

Attached are the results for my local tests using LLVM compiler: rocblas_llvm_log.txt

I see three problems with these results:

  1. SCAL seg fault: Tracked in issue #233
  2. Warnings due to hipCtxGetCurrent: Tracked in issue #234
  3. Warnings due to linking module: Tracked in https://github.com/intel/llvm/issues/6922

I also ran this PR using hipSYCL. Attached are the results: rocblas_hipsycl_log.txt The errors I see are intermittent and do not appear then we run gtest directly. I don't think, these are related to this PR.

Based on the above discussion, I don't have concerns with merging this PR. The mentioned issues can be tracked separately.

Thanks for all your efforts on enabling rocBLAS backend with LLVM-HIP compiler.

mmeterel avatar Sep 30 '22 18:09 mmeterel

@vmalia @mkrainiuk I just added my test results along with related issues. The PR is ok to merge IMHO. Could you please take a look from your side?

mmeterel avatar Sep 30 '22 18:09 mmeterel