occa icon indicating copy to clipboard operation
occa copied to clipboard

Clang transpiler integration

Open vyast-softserveinc opened this issue 1 year ago • 30 comments

Description

This pull request is aimed for integration occa-transpiler library for providing full C++ support under the OCCA

Added:

option for switching between old & new transpiler transpiler-version

vyast-softserveinc avatar May 13 '24 07:05 vyast-softserveinc

cmake -DOCCA_CLANG_BASED_TRANSPILER=ON worked for me to get the new transpiler source and generate the build.

deukhyun-cha avatar May 22 '24 14:05 deukhyun-cha

Hi - Any hope that this gets merged?

amikstcyr avatar Jul 04 '24 11:07 amikstcyr

Please take a look at this issue.

kris-rowe avatar Jul 26 '24 17:07 kris-rowe

@kris-rowe the issue is addressed, please take a look and try the fix.

YuraCobain avatar Aug 02 '24 10:08 YuraCobain

@kris-rowe all issues were addressed, can we please have a conclusion on this?

amikstcyr avatar Aug 27 '24 09:08 amikstcyr

Hi @kris-rowe If you have any additional comments, questions or concerns I am glad to resolve to merge the PR.

IuriiKobein avatar Sep 20 '24 08:09 IuriiKobein

Hi @IuriiKobein, I am planning to test this branch soon. I will let you know if I run into any issues.

thilinarmtb avatar Sep 20 '24 13:09 thilinarmtb

I started testing this branch on Frontier at OLCF and I am running into a segmentation fault when I run 31_oklt_v3_moving_avg test.

I had to make the following changes in occa-transpiler since the CMake 3.26 was not available on Frontier (hope this is not the reason for the segfault).

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 2d9cc30..659d44f 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -1,4 +1,4 @@
-cmake_minimum_required(VERSION 3.26)
+cmake_minimum_required(VERSION 3.23)
 
 project(occa-transpiler VERSION 0.0.1 LANGUAGES C CXX)
 
diff --git a/lib/CMakeLists.txt b/lib/CMakeLists.txt
index 182f1e0..dd8b545 100644
--- a/lib/CMakeLists.txt
+++ b/lib/CMakeLists.txt
@@ -1,4 +1,4 @@
-cmake_minimum_required(VERSION 3.26)
+cmake_minimum_required(VERSION 3.23)
 project (occa-transpiler VERSION 0.0.1 LANGUAGES CXX)
 
 set(CMAKE_CXX_STANDARD 17)
diff --git a/tool/CMakeLists.txt b/tool/CMakeLists.txt
index 543d898..98cdb5c 100644
--- a/tool/CMakeLists.txt
+++ b/tool/CMakeLists.txt
@@ -1,4 +1,4 @@
-cmake_minimum_required(VERSION 3.26)
+cmake_minimum_required(VERSION 3.23)
 project (occa-tool VERSION 0.0.1 LANGUAGES CXX)

Then I followed the build instructions and everything built fine. When I tried to run the test, I get the following:

[[email protected] 31_oklt_v3_moving_avg]$ export OKLT_LOG_LEVEL=trace
[[email protected] 31_oklt_v3_moving_avg]$ ./examples_cpp_oklt_v3_moving_avg                                                                                                                                                                            
[11:50:40.179] [I] start: OKL_DIRECTIVE_EXPANSION_STAGE [stage_action_runner.cpp:32]                                                                                                                                                                           
[11:50:40.179] [T] input source:                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                               
#include "constants.h"                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                               
template<class T,                                                                                                                                                                                                                                              
         int THREADS,                                                                                                                                                                                                                                          
         int WINDOW>                                                                                                                                                                                                                                           
struct MovingAverage {                                                                                                                                                                                                                                         
    MovingAverage(int inputSize,                                                                                                                                                                                                                               
                  int outputSize,                                                                                                                                                                                                                              
                  T *shared_input,                                                                                                                                                                                                                             
                  T *shared_output)                                                                                                                                                                                                                            
        :_inputSize(inputSize)                                                                                                                                                                                                                                 
        ,_outputSize(outputSize)                                                                                                                                                                                                                               
        ,_shared_data(shared_input)                                                                                                                                                                                                                            
        ,_result_data(shared_output)                                                                                                                                                                                                                           
    {}                                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                               
    void syncCopyFrom(const T *input, int block_idx, int thread_idx) {
        int linearIdx = block_idx * THREADS + thread_idx;
        //INFO: copy base chunk
        if(linearIdx < _inputSize) {
            _shared_data[thread_idx] = input[linearIdx];
        }
        //INFO: copy WINDOW chunk
        int tailIdx = (block_idx + 1) * THREADS + thread_idx;
        if(tailIdx < _inputSize && thread_idx < WINDOW) {
            _shared_data[THREADS + thread_idx] = input[tailIdx];
        }
        @barrier;
    }

    void process(int thread_idx) {
        T sum = T();
        for(int i = 0; i < WINDOW; ++i) {
            sum += _shared_data[thread_idx + i];
        }
        _result_data[thread_idx] = sum / WINDOW;
        @barrier;
    }

    void syncCopyTo(T *output, int block_idx, int thread_idx) { 
        int linearIdx = block_idx * THREADS + thread_idx;
        if(linearIdx < _outputSize) {
            output[linearIdx] = _result_data[thread_idx];
        }
        @barrier;
    }
private:
    int _inputSize;
    int _outputSize;

    //INFO: not supported
    // @shared T _data[THREADS_PER_BLOCK + WINDOW_SIZE];
    // @shared T _result[THREADS_PER_BLOCK];

    T *_shared_data;
    T *_result_data;
};

@kernel void movingAverage32f(@restrict const float *inputData, 
                              int inputSize,
                              @restrict float *outputData,
                              int outputSize)
{
    @outer(0) for (int block_idx = 0; block_idx < outputSize / THREADS_PER_BLOCK + 1; ++block_idx) {
        @shared float blockInput[THREADS_PER_BLOCK + WINDOW_SIZE];
        @shared float blockResult[THREADS_PER_BLOCK];
        MovingAverage<float, THREADS_PER_BLOCK, WINDOW_SIZE> ma{
                inputSize,
                outputSize,
                blockInput,
                blockResult
        };
        @inner(0) for(int thread_idx = 0; thread_idx < THREADS_PER_BLOCK; ++thread_idx) {
            ma.syncCopyFrom(inputData, block_idx, thread_idx);
            ma.process(thread_idx);
            ma.syncCopyTo(outputData, block_idx, thread_idx);
        }
    }
}

 [stage_action_runner.cpp:33]
Segmentation fault

This is the backtrace I get with gdb:

#0  0x00007fffe78bf121 in llvm::vfs::InMemoryFileSystem::addFile(llvm::Twine const&, long, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer> >, std::optional<unsigned int>, std::optional<unsigned int>, std::optional<llvm::sys::fs::file_type>, std::optional<llvm::sys::fs::perms>) () from /ccs/home/thilina/fus166/Workspace/anl/occa-transpiler/occa/build/lib/libocca-transpiler.so.17
#1  0x00007fffe77df5aa in oklt::addInstrinsicStub (session=..., compiler=...) at /ccs/home/thilina/fus166/.local/occa-transpiler/clang/include/llvm/ADT/Twine.h:285
#2  0x00007fffe782dd43 in oklt::StageAction::PrepareToExecuteAction (this=0x4c1a00, compiler=...) at /usr/include/c++/12/bits/shared_ptr_base.h:1665
#3  0x00007fffe973b398 in clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) () from /ccs/home/thilina/fus166/Workspace/anl/occa-transpiler/occa/build/lib/libocca-transpiler.so.17
#4  0x00007fffe7949f0e in clang::tooling::FrontendActionFactory::runInvocation(std::shared_ptr<clang::CompilerInvocation>, clang::FileManager*, std::shared_ptr<clang::PCHContainerOperations>, clang::DiagnosticConsumer*) ()
   from /ccs/home/thilina/fus166/Workspace/anl/occa-transpiler/occa/build/lib/libocca-transpiler.so.17
#5  0x00007fffe79415ac in clang::tooling::ToolInvocation::runInvocation(char const*, clang::driver::Compilation*, std::shared_ptr<clang::CompilerInvocation>, std::shared_ptr<clang::PCHContainerOperations>) ()
   from /ccs/home/thilina/fus166/Workspace/anl/occa-transpiler/occa/build/lib/libocca-transpiler.so.17
#6  0x00007fffe79452d8 in clang::tooling::ToolInvocation::run() () from /ccs/home/thilina/fus166/Workspace/anl/occa-transpiler/occa/build/lib/libocca-transpiler.so.17
#7  0x00007fffe7949456 in clang::tooling::runToolOnCodeWithArgs(std::unique_ptr<clang::FrontendAction, std::default_delete<clang::FrontendAction> >, llvm::Twine const&, llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem>, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, llvm::Twine const&, llvm::Twine const&, std::shared_ptr<clang::PCHContainerOperations>) ()
   from /ccs/home/thilina/fus166/Workspace/anl/occa-transpiler/occa/build/lib/libocca-transpiler.so.17
#8  0x00007fffe794993d in clang::tooling::runToolOnCodeWithArgs(std::unique_ptr<clang::FrontendAction, std::default_delete<clang::FrontendAction> >, llvm::Twine const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, llvm::Twine const&, llvm::Twine const&, std::shared_ptr<clang::PCHContainerOperations>, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&) () from /ccs/home/thilina/fus166/Workspace/anl/occa-transpiler/occa/build/lib/libocca-transpiler.so.17
#9  0x00007fffe783266d in oklt::runStageAction (stageName=..., session=...) at /ccs/home/thilina/fus166/Workspace/anl/occa-transpiler/occa/deps/occa-transpiler/lib/pipeline/core/stage_action_runner.cpp:68
#10 0x00007fffe7833144 in oklt::runPipeline (pipeline=..., session=...) at /ccs/home/thilina/fus166/Workspace/anl/occa-transpiler/occa/deps/occa-transpiler/lib/pipeline/core/stage_action_runner.cpp:97
#11 0x00007fffe7829a21 in oklt::normalizeAndTranspile (input=...) at /ccs/home/thilina/fus166/Workspace/anl/occa-transpiler/occa/deps/occa-transpiler/lib/pipeline/normalizer_and_transpiler.cpp:16
#12 0x00007fffed8eaad4 in occa::transpiler::Transpiler::run (this=this@entry=0x7fffffff5140, filename=..., mode=..., kernelProps=...) at /ccs/home/thilina/fus166/Workspace/anl/occa-transpiler/occa/src/occa/internal/utils/transpiler_utils.cpp:135
#13 0x00007fffed8b3ee4 in occa::serial::v3::transpileFile (filename=..., outputFile=..., kernelProps=..., metadata=..., mode=...) at /ccs/home/thilina/fus166/Workspace/anl/occa-transpiler/occa/src/occa/internal/modes/serial/device.cpp:69
#14 0x00007fffed8b7147 in occa::serial::device::buildKernel (this=this@entry=0x478460, filename=..., kernelName=..., kernelHash=..., kernelProps=..., isLauncherKernel=<optimized out>, isLauncherKernel@entry=false)
    at /ccs/home/thilina/fus166/Workspace/anl/occa-transpiler/occa/src/occa/internal/modes/serial/device.cpp:353
#15 0x00007fffed8b778d in occa::serial::device::buildKernel (this=this@entry=0x478460, filename=..., kernelName=..., kernelHash=..., kernelProps=...)
    at /ccs/home/thilina/fus166/Workspace/anl/occa-transpiler/occa/src/occa/internal/modes/serial/device.cpp:168
#16 0x00007fffed7010c6 in occa::device::buildKernel (this=this@entry=0x7fffffff5a40, filename=..., kernelName=..., props=...) at /ccs/home/thilina/fus166/Workspace/anl/occa-transpiler/occa/src/core/device.cpp:394
#17 0x0000000000401ea2 in main (argc=<optimized out>, argv=<optimized out>) at /ccs/home/thilina/fus166/Workspace/anl/occa-transpiler/occa/examples/cpp/31_oklt_v3_moving_avg/main.cpp:67

I am using gcc=12.3.0 to build OCCA and doing a release build. I will try a debug build and see if it gives me more information.

PS: I was actually doing a release build with debug info.

thilinarmtb avatar Sep 23 '24 15:09 thilinarmtb

Thanks for report. At this moment I have a quick question that help us to proceed with a potential fix.

Did clang was installed according to the https://github.com/libocca/occa-transpiler?tab=readme-ov-file#setup-clang-17 section? If yes which exactly variant was used?

IuriiKobein avatar Sep 23 '24 16:09 IuriiKobein

Thanks for report. At this moment I have a quick question that help us to proceed with a potential fix.

Did clang was installed according to the https://github.com/libocca/occa-transpiler?tab=readme-ov-file#setup-clang-17 section? If yes which exactly variant was used?

I installed clang from the source checking out the llvmorg-17.0.6 tag. Below is the commit:

commit 6009708b4367171ccdbf4b5905cb6a803753fe18 (grafted, HEAD, tag: llvmorg-17.0.6)
Author: Tobias Hieta <[email protected]>
Date:   Tue Nov 28 09:52:28 2023 +0100

    Revert "[runtimes] Add missing test dependencies to check-all (#72955)"
    
    This reverts commit e957e6dcb29d94e4e1678da9829b77009be88926.
    
    The commit was reverted on main because of issues. We will not carry
    this in the release branch for 17.x

These are the configure and build commands I used:

cmake -S llvm -B build -G "Unix Makefiles" \
  -DCMAKE_C_COMPILER=`which gcc` \
  -DCMAKE_CXX_COMPILER=`which g++` \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_INSTALL_PREFIX=~/fus166/.local/occa-transpiler/clang \
  -DLLVM_ENABLE_WERROR=OFF \
  -DLLVM_TARGETS_TO_BUILD='X86' \
  -DLLVM_PARALLEL_LINK_JOBS=1 \
  -DLLVM_ENABLE_RTTI=ON \
  -DCMAKE_POLICY_DEFAULT_CMP0094=NEW \
  -DCMAKE_VERBOSE_MAKEFILE=ON \
  -DCMAKE_FIND_USE_PACKAGE_REGISTRY=OFF \
  -DLLVM_ENABLE_PROJECTS="polly;lld;lldb;clang-tools-extra;llvm;clang" \
  -DLLVM_ENABLE_RUNTIMES="libunwind;libcxx;libcxxabi;compiler-rt" \
  -DLLVM_REQUIRES_RTTI=ON \
  -DLLVM_ENABLE_RTTI=ON \
  -DLLVM_ENABLE_EH=ON \
  -DLLVM_POLLY_LINK_INTO_TOOLS=ON \
  -DLLVM_Z3_INSTALL_DIR=${Z3_INSTALL_DIR} \
  -DLLVM_ENABLE_Z3_SOLVER=OFF

make -C build install -j12

I think the only thing different to the configure command in the instructions is that I turned-off Z3-solver.

thilinarmtb avatar Sep 23 '24 16:09 thilinarmtb

So far we couldn't reproduce the issue on our local machines with already setup configuration. The next use the same CMake version and clang build options as yours to catch the issue.

YuraCobain avatar Sep 23 '24 17:09 YuraCobain

Seems like the reason for the segfault was that I used two different versions of gcc: one version to build clang and another version to build occa.

Once I used the same gcc version for both, I don't see a segfault anymore. Now I can run the test but it still fails:

[[email protected] 31_oklt_v3_moving_avg]$ ./examples_cpp_oklt_v3_moving_avg 
Comparison with gold values has failed

I can attach the full log with trace on if that is helpful.

thilinarmtb avatar Sep 23 '24 19:09 thilinarmtb

Glad that the root cause of segfault is found. The test example was tested only for CUDA/HIP backends. You could verify it by following options:

examples_cpp_oklt_v3_moving_avg -d "{mode: 'CUDA', device_id: 0}"

We are working to fix it for Serial mode as well that is the default one if -d option is omitted.

YuraCobain avatar Sep 23 '24 19:09 YuraCobain

Thanks ! Yes, the example passes with HIP backend. I will try to test this on a few more kernels.

thilinarmtb avatar Sep 23 '24 20:09 thilinarmtb

Hi Thilina,

The example "31_oklt_v3_moving_avg" is fixed to support host only backends: Serial, OpenMP. Please pull the latest change and try to fix. Looking forward for your feedback.

IuriiKobein avatar Sep 23 '24 21:09 IuriiKobein

With your latest fix, the tests pass for HIP, Serial and OpenMP backends. I will test this a bit more.

thilinarmtb avatar Sep 24 '24 14:09 thilinarmtb

@IuriiKobein : I added a simple kernel which calculates the dot product between two vectors here. Seems like it fails with the transpiler. The failure is due to transpiler not recognizing unsigned int. I think OCCA supports unsigned int (I may be wrong).

thilinarmtb avatar Sep 24 '24 19:09 thilinarmtb

@thilinarmtb please refer the issue reported above for clarification.

YuraCobain avatar Sep 24 '24 20:09 YuraCobain

Is transpiler version 2 is the same as regular OCCA? Seems like unsigned

@thilinarmtb please refer the issue reported above for clarification.

We will continue the discussion there till the issue is resolved.

thilinarmtb avatar Sep 24 '24 20:09 thilinarmtb

I am fine with merging this. I can add a few more tests after the PR gets merged. Below are a few minor comments I have.

Probably we should figure out the minimum versions of CMake and gcc required and add those versions to CMakeListsts.txt and documentation. For example, I don't think we need a CMake version as new as the one currently used in this PR.

Also, it is worth mentioning that you have to build clang and the occa-transpiler using the same compiler. I don't know if this is an actual requirement. But I had to do so in order to run it on my testing machine.

thilinarmtb avatar Sep 30 '24 14:09 thilinarmtb

Hi @thilinarmtb I have lowered cmake version to be same as OCCA has. Also added notes regarding minimum GCC version and to use the same version of compiler to build clang and transpiler itself.

IuriiKobein avatar Oct 04 '24 16:10 IuriiKobein

@IuriiKobein : Thank you very much for the changes. I will merge the PR once the tests pass.

thilinarmtb avatar Oct 07 '24 14:10 thilinarmtb

Hi @thilinarmtb @kris-rowe Do you have any updates regarding this PR? Thanks

IuriiKobein avatar Oct 11 '24 09:10 IuriiKobein

I realized that the occa-transpiler is not tested in GitHub CI. I am trying to test it here. I am running into a bunch of errors (which I think is due to some header file conflict). I don't run into this issue locally.

thilinarmtb avatar Oct 15 '24 19:10 thilinarmtb

@thilinarmtb could you please test default compiler flags from OCCA cmake on CI?

-- C flags : -Wall -Wextra -Wunused-function -Wunused-variable -Wwrite-strings -Wfloat-equal -Wcast-align -Wlogical-op -Wshadow -Wno-c++11-long-long -O3 -DNDEBUG -- CXX flags : -Wall -Wextra -Wunused-function -Wunused-variable -Wwrite-strings -Wfloat-equal -Wcast-align -Wlogical-op -Wshadow -Wno-unused-parameter -fno-strict-aliasing -O3 -DNDEBUG

IuriiKobein avatar Oct 15 '24 19:10 IuriiKobein

@IuriiKobein : It worked. But it takes about 18 minutes to build OCCA with transpiler enabled. One option is to package transpiler as a conda package, install it on GitHub CI runners (this will be fast) and then build OCCA by linking with the transpiler library. I can package the transpiler to a conda package if you are interested in taking this route.

thilinarmtb avatar Oct 16 '24 20:10 thilinarmtb

I appreciate it if you could handle this approach. We plan to speed up a compilation time of transpiler after initial merge to OCCA.

BTW on average machine with 16 logical i7 cores it takes about 3 minutes so it is a little bit of suprise why on CI it is in times slower.

IuriiKobein avatar Oct 16 '24 20:10 IuriiKobein

My build using 16 parallel processes failed in GitHub CI. I tried both 8 and 4 processes and it worked but took about 18 minutes. See here.

I will open a few minor issues on occa-transpiler repo in order to fix a few things before going ahead with a conda package.

thilinarmtb avatar Oct 16 '24 20:10 thilinarmtb

Issues are fixed and PR is updated.

IuriiKobein avatar Oct 16 '24 22:10 IuriiKobein

Thanks @IuriiKobein. I will go ahead with the conda package.

thilinarmtb avatar Oct 16 '24 23:10 thilinarmtb