Clang segfault with short array of very large elements
I get the following error when compiling against CUDA
0. Program arguments: /usr/lib/llvm-11/bin/clang -cc1 -triple nvptx64-nvidia-cuda -aux-triple x86_64-pc-linux-gnu -S -disable-free -disable-llvm-verifier -discard-value-names -main-file-name PndFtsTrLineZOXY_SYCL.cpp -mrelocation-model static -mframe-pointer=all -fno-rounding-math -fno-verbose-asm -no-integrated-as -aux-target-cpu x86-64 -fcuda-is-device -mlink-builtin-bitcode /usr/local/cuda-10.2/nvvm/libdevice/libdevice.10.bc -target-feature +ptx65 -target-sdk-version=10.2 -target-cpu sm_30 -fno-split-dwarf-inlining -debug-info-kind=limited -dwarf-version=2 -debugger-tuning=gdb -resource-dir /usr/lib/llvm-11/lib/clang/11.0.1 -internal-isystem /usr/lib/llvm-11/lib/clang/11.0.1/include/cuda_wrappers -internal-isystem /usr/local/cuda-10.2/include -include __clang_cuda_runtime_wrapper.h -D HIPSYCL_CLANG -I /usr/local/bin/../include/ -I /usr/local/bin/../include/hipSYCL/contrib -D FaisPndFtsTrLibSYCL_EXPORTS -D HIPSYCL_DEBUG_LEVEL=1 -D PNDFTSTR_DEBUG -D PNDFTSTR_GROUPED_HIT_SET_CAPACITY=512 -D PNDFTSTR_LINE_SUBTRACK_CAPACITY=32 -D PNDFTSTR_MAX_TRACKS_PER_EVENT=128 -D PNDFTSTR_OUTERMOST_HIT_PAIRS_CAPACITY=2048 -I /home/bsobol/work/bs_playground/basic_utility -I /home/bsobol/work/bs_playground/fts_track_recon -I /home/bsobol/work/bs_playground/fts_benchmark_utility -I /usr/local/include/hipSYCL/contrib -D NDEBUG -internal-isystem /usr/bin/../lib/gcc/x86_64-linux-gnu/8/../../../../include/c++/8 -internal-isystem /usr/bin/../lib/gcc/x86_64-linux-gnu/8/../../../../include/x86_64-linux-gnu/c++/8 -internal-isystem /usr/bin/../lib/gcc/x86_64-linux-gnu/8/../../../../include/x86_64-linux-gnu/c++/8 -internal-isystem /usr/bin/../lib/gcc/x86_64-linux-gnu/8/../../../../include/c++/8/backward -internal-isystem /usr/bin/../lib/gcc/x86_64-linux-gnu/8/../../../../include/c++/8 -internal-isystem /usr/bin/../lib/gcc/x86_64-linux-gnu/8/../../../../include/x86_64-linux-gnu/c++/8 -internal-isystem /usr/bin/../lib/gcc/x86_64-linux-gnu/8/../../../../include/x86_64-linux-gnu/c++/8 -internal-isystem /usr/bin/../lib/gcc/x86_64-linux-gnu/8/../../../../include/c++/8/backward -internal-isystem /usr/local/include -internal-isystem /usr/lib/llvm-11/lib/clang/11.0.1/include -internal-externc-isystem /usr/include/x86_64-linux-gnu -internal-externc-isystem /include -internal-externc-isystem /usr/include -internal-isystem /usr/local/include -internal-isystem /usr/lib/llvm-11/lib/clang/11.0.1/include -internal-externc-isystem /usr/include/x86_64-linux-gnu -internal-externc-isystem /include -internal-externc-isystem /usr/include -O0 -Wall -Wextra -Wno-unknown-pragmas -pedantic -std=c++17 -fdeprecated-macro -fno-dwarf-directory-asm -fno-autolink -fdebug-compilation-dir /home/bsobol/work/bs_playground/cmake-build-release-2/fts_track_recon/sycl -ferror-limit 19 -fgnuc-version=4.2.1 -fcxx-exceptions -fexceptions -fcolor-diagnostics -load /usr/local/bin/../lib/libhipSYCL_clang.so -o /tmp/PndFtsTrLineZOXY_SYCL-9e61ef.s -x cuda /home/bsobol/work/bs_playground/fts_track_recon/sycl/PndFtsTrLineZOXY_SYCL.cpp
1. <eof> parser at end of file
2. Code generation
3. Running pass 'Function Pass Manager' on module 'fts_track_recon/sycl/PndFtsTrLineZOXY_SYCL.cpp'.
4. Running pass 'NVPTX DAG->DAG Pattern Instruction Selection' on function '@_ZN4fais26PndFtsTrLineZOX_ExecKernelIN7hipsycl4sycl8accessorINS_12PndFtsTrGeomELi1ELNS2_6access4modeE1024ELNS5_6targetE2014ELNS5_11placeholderE0EEENS3_INS_26PndFtsTrGroupedHitSetFImplILt512EEELi1ELS6_1024ELS7_2014ELS8_0EEENS3_ISt5arrayINS_25PndFtsTrLineSubtrackFImplILt32EEELm128EELi1ELS6_1026ELS7_2014ELS8_0EEENS3_ItLi1ELS6_1026ELS7_2014ELS8_0EEEEclENS2_2idILi1EEE'
#0 0x00007fd74c80d74f llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/usr/lib/x86_64-linux-gnu/libLLVM-11.so.1+0xa9974f)
#1 0x00007fd74c80bac0 llvm::sys::RunSignalHandlers() (/usr/lib/x86_64-linux-gnu/libLLVM-11.so.1+0xa97ac0)
#2 0x00007fd74c80dc25 (/usr/lib/x86_64-linux-gnu/libLLVM-11.so.1+0xa99c25)
#3 0x00007fd75393b980 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x12980)
#4 0x00007fd74ce636b4 llvm::SelectionDAG::ReplaceAllUsesWith(llvm::SDNode*, llvm::SDValue const*) (/usr/lib/x86_64-linux-gnu/libLLVM-11.so.1+0x10ef6b4)
#5 0x00007fd74ccb9261 (/usr/lib/x86_64-linux-gnu/libLLVM-11.so.1+0xf45261)
#6 0x00007fd74ccb59e2 (/usr/lib/x86_64-linux-gnu/libLLVM-11.so.1+0xf419e2)
#7 0x00007fd74ccb4138 llvm::SelectionDAG::Combine(llvm::CombineLevel, llvm::AAResults*, llvm::CodeGenOpt::Level) (/usr/lib/x86_64-linux-gnu/libLLVM-11.so.1+0xf40138)
#8 0x00007fd74ce76812 llvm::SelectionDAGISel::CodeGenAndEmitDAG() (/usr/lib/x86_64-linux-gnu/libLLVM-11.so.1+0x1102812)
#9 0x00007fd74ce75fe2 llvm::SelectionDAGISel::SelectAllBasicBlocks(llvm::Function const&) (/usr/lib/x86_64-linux-gnu/libLLVM-11.so.1+0x1101fe2)
#10 0x00007fd74ce73ce1 llvm::SelectionDAGISel::runOnMachineFunction(llvm::MachineFunction&) (/usr/lib/x86_64-linux-gnu/libLLVM-11.so.1+0x10ffce1)
#11 0x00007fd74cae1cae llvm::MachineFunctionPass::runOnFunction(llvm::Function&) (/usr/lib/x86_64-linux-gnu/libLLVM-11.so.1+0xd6dcae)
#12 0x00007fd74c91d639 llvm::FPPassManager::runOnFunction(llvm::Function&) (/usr/lib/x86_64-linux-gnu/libLLVM-11.so.1+0xba9639)
#13 0x00007fd74c922ca3 llvm::FPPassManager::runOnModule(llvm::Module&) (/usr/lib/x86_64-linux-gnu/libLLVM-11.so.1+0xbaeca3)
#14 0x00007fd74c91dc80 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/usr/lib/x86_64-linux-gnu/libLLVM-11.so.1+0xba9c80)
#15 0x00007fd7522f6453 clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::HeaderSearchOptions const&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::DataLayout const&, llvm::Module*, clang::BackendAction, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream> >) (/usr/lib/x86_64-linux-gnu/libclang-cpp.so.11+0x15bf453)
#16 0x00007fd7525b5d46 (/usr/lib/x86_64-linux-gnu/libclang-cpp.so.11+0x187ed46)
#17 0x00007fd752c7befc clang::MultiplexConsumer::HandleTranslationUnit(clang::ASTContext&) (/usr/lib/x86_64-linux-gnu/libclang-cpp.so.11+0x1f44efc)
#18 0x00007fd751641f13 clang::ParseAST(clang::Sema&, bool, bool) (/usr/lib/x86_64-linux-gnu/libclang-cpp.so.11+0x90af13)
#19 0x00007fd752c47ae8 clang::FrontendAction::Execute() (/usr/lib/x86_64-linux-gnu/libclang-cpp.so.11+0x1f10ae8)
#20 0x00007fd752bfef31 clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) (/usr/lib/x86_64-linux-gnu/libclang-cpp.so.11+0x1ec7f31)
#21 0x00007fd752cacc00 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) (/usr/lib/x86_64-linux-gnu/libclang-cpp.so.11+0x1f75c00)
#22 0x00000000004125d7 cc1_main(llvm::ArrayRef<char const*>, char const*, void*) (/usr/lib/llvm-11/bin/clang+0x4125d7)
#23 0x0000000000410a5e (/usr/lib/llvm-11/bin/clang+0x410a5e)
#24 0x000000000041086a main (/usr/lib/llvm-11/bin/clang+0x41086a)
#25 0x00007fd74b065bf7 __libc_start_main /build/glibc-S7xCS9/glibc-2.27/csu/../csu/libc-start.c:344:0
#26 0x000000000040dd1a _start (/usr/lib/llvm-11/bin/clang+0x40dd1a)
clang: error: unable to execute command: Segmentation fault (core dumped)
This happens on both stable and develop branches with hipSYCL compiled with gcc-8 (also tried to compile with clang-11, same result).
System config: Ubuntu 18.04 gcc-8.4 clang-11 CUDA 10.2 sm_30 arch GPU
I believe it is similar issue as in #356
I cannot reproduce on Ubuntu 20.04, clang 11 (from apt.llvm.org), CUDA 10.1 when compiling for sm_30. Where did you get your clang from? Does it happen with every input code?
Clang is from apt.llvm.org. Hello world kernel compiles fine.
As I can deduce from line 4 it crashes during compilation of my main kernel, which is quite complex and large (but doesn't use any advanced or fancy SYCL features).
I will try to investigate on what exactly it crashes.
May it be because I have slightly higher CUDA version (10.2)?
Okay, that's a good sign if a hello world program compiles.
Since hipSYCL doesn't really touch code generation, it's likely that crashes in the llvm backend like this are caused by llvm bugs. Can you try with clang 10 or clang 12?
The latest CUDA version that clang officially supports is 10.1 (it just treats all newer versions as if they were 10.1). So this might be a bit fragile, but this would still be the first time that I have heard that this has caused an issue if it is because of the CUDA version.
Yes, it happened also on clang 10.
I managed to figure out what part of my code caused the issue. I had a short buffer of <10 objects containing long arrays of ints (~13k). It was a design remain from 'don't change anything you don't have to' approach when I did initial porting to SYCL and while it is probably not a good memory layout, do uou have any idea why it causes such a crash?
After 'flipping it over' to long buffer of short arrays, program compiles succesfully (at least with optimisations disabled, with optimisations it takes forever)
Unfortunately now I get
[hipSYCL Error] Runtime has registered error: from /home/bsobol/work/git_repos/hipSYCL/src/runtime/cuda/cuda_event.cpp:66 @ wait(): cuda_node_event: cudaEventSynchronize() failed (error code = CUDA:702)
[hipSYCL Error] Runtime has registered error: from /home/bsobol/work/git_repos/hipSYCL/src/runtime/cuda/cuda_event.cpp:66 @ wait(): cuda_node_event: cudaEventSynchronize() failed (error code = CUDA:702)
[hipSYCL Error] Runtime has registered error: from /home/bsobol/work/git_repos/hipSYCL/src/runtime/cuda/cuda_queue.cpp:217 @ submit_memcpy(): cuda_queue: Couldn't submit memcpy (error code = CUDA:702)
[hipSYCL Error] Runtime has registered error: from /home/bsobol/work/git_repos/hipSYCL/src/runtime/cuda/cuda_allocator.cpp:92 @ free(): cuda_allocator: cudaFree() failed (error code = CUDA:702)
[hipSYCL Error] Runtime has registered error: from /home/bsobol/work/git_repos/hipSYCL/src/runtime/cuda/cuda_event.cpp:45 @ ~cuda_node_event(): cuda_node_event: Couldn't destroy event (error code = CUDA:702)
[hipSYCL Error] Runtime has registered error: from /home/bsobol/work/git_repos/hipSYCL/src/runtime/cuda/cuda_event.cpp:45 @ ~cuda_node_event(): cuda_node_event: Couldn't destroy event (error code = CUDA:702)
[hipSYCL Error] Runtime has registered error: from /home/bsobol/work/git_repos/hipSYCL/src/runtime/cuda/cuda_allocator.cpp:92 @ free(): cuda_allocator: cudaFree() failed (error code = CUDA:702)
[hipSYCL Error] Runtime has registered error: from /home/bsobol/work/git_repos/hipSYCL/src/runtime/cuda/cuda_allocator.cpp:92 @ free(): cuda_allocator: cudaFree() failed (error code = CUDA:702)
terminate called after throwing an instance of 'hipsycl::sycl::runtime_error'
[hipSYCL Error] Runtime has registered error: from /home/bsobol/work/git_repos/hipSYCL/src/runtime/cuda/cuda_allocator.cpp:92 @ free(): cuda_allocator: cudaFree() failed (error code = CUDA:702)
which I believe is somehow related to #314. queue.wait() doesn't help here.
I managed to figure out what part of my code caused the issue. I had a short buffer of <10 objects containing long arrays of ints (~13k). It was a design remain from 'don't change anything you don't have to' approach when I did initial porting to SYCL and while it is probably not a good memory layout, do uou have any idea why it causes such a crash?
My guess would be that you might be hitting some internal limitation of the llvm nvptx backend. I would suggest to try reproducing with pure clang-compiled CUDA, and if it reproduces, open an LLVM bug report.
which I believe is somehow related to #314. queue.wait() doesn't help here.
It is highly unlikely that this is related. #314 is errors appearing in destructors, whereas here it seems that after a certain point, each CUDA call seems to fail. Probably this is because CUDA errors are sticky, i.e. after one call has failed all subsequent calls will fail as well (we don't clear CUDA error state yet, but we should). CUDA error 702 seems to be cudaErrorLaunchTimeout. Maybe your kernel is running for too long or you enqueue too many kernels? Maybe cuda-memcheck can also reveal more information.
Maybe your kernel is running for too long or you enqueue too many kernels? Maybe cuda-memcheck can also reveal more information.
That was it, I've completely missed that this GPU is responsible for Xserver. sudo init 3 did the trick.
Now I can finally go back to coding... and maybe resolving issues with my gfx803.
Many thanks for your insights.
My guess would be that you might be hitting some internal limitation of the llvm nvptx backend. I would suggest to try reproducing with pure clang-compiled CUDA, and if it reproduces, open an LLVM bug report.
I will try, and when I have some results, I will also post it here and close the issue.