faasm icon indicating copy to clipboard operation
faasm copied to clipboard

WAVM errors in cluster (Failure value returned from cantFail wrapped call)

Open mmathys opened this issue 3 years ago • 2 comments

Setup: 6 nodes, LAMMPS experiment. This error only occurs sometimes (~30% of runs)

faasm-dev-worker-10  | Failure value returned from cantFail wrapped call
faasm-dev-worker-10  | section header table goes past the end of the file: e_shoff = 0x9d8300
faasm-dev-worker-10  | UNREACHABLE executed at /usr/lib/llvm-10/include/llvm/Support/Error.h:744!
More detailed output
faasm-dev-worker-8   | Caught stack backtrace:
faasm-dev-worker-8   | /build/faasm/bin/pool_runner(_ZN7faabric4util11handleCrashEi+0x3b)[0x52188ab]
faasm-dev-worker-8   | /build/faasm/bin/pool_runner(_Z12crashHandleri+0x13)[0x5218853]
faasm-dev-worker-8   | /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f9b91acb090]
faasm-dev-worker-8   | /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f9b91acb00b]
faasm-dev-worker-8   | /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f9b91aaa859]
faasm-dev-worker-8   | /build/faasm/bin/pool_runner[0x4cdd0b1]
faasm-dev-worker-8   | /build/faasm/bin/pool_runner(_ZN4llvm8cantFailISt10unique_ptrINS_6object10ObjectFileESt14default_deleteIS3_EEEET_NS_8ExpectedIS7_EEPKc+0x14b)[0x2d86a5b]
faasm-dev-worker-8   | /build/faasm/bin/pool_runner(_ZN4WAVM7LLVMJIT6ModuleC2ERKSt6vectorIhSaIhEERKNS_7HashMapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEmNS_17DefaultHashPolicyISD_EEEEbOSD_+0x212)[0x2d82e22]
faasm-dev-worker-8   | /build/faasm/bin/pool_runner(_ZSt12construct_atIN4WAVM7LLVMJIT6ModuleEJRKSt6vectorIhSaIhEERNS0_7HashMapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEmNS0_17DefaultHashPolicyISE_EEEEbSE_EEDTgsnwcvPvLi0E_T_pispclsr3stdE7declvalIT0_EEEEPSK_DpOSL_+0x71)[0x2d943f1]
faasm-dev-worker-8   | /build/faasm/bin/pool_runner(_ZNSt16allocator_traitsISaIN4WAVM7LLVMJIT6ModuleEEE9constructIS2_JRKSt6vectorIhSaIhEERNS0_7HashMapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEmNS0_17DefaultHashPolicyISH_EEEEbSH_EEEvRS3_PT_DpOT0_+0x70)[0x2d941b0]
faasm-dev-worker-8   | /build/faasm/bin/pool_runner(_ZNSt23_Sp_counted_ptr_inplaceIN4WAVM7LLVMJIT6ModuleESaIS2_ELN9__gnu_cxx12_Lock_policyE2EEC2IJRKSt6vectorIhSaIhEERNS0_7HashMapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEmNS0_17DefaultHashPolicyISJ_EEEEbSJ_EEES3_DpOT_+0xca)[0x2d93efa]
faasm-dev-worker-8   | /build/faasm/bin/pool_runner(_ZNSt14__shared_countILN9__gnu_cxx12_Lock_policyE2EEC2IN4WAVM7LLVMJIT6ModuleESaIS6_EJRKSt6vectorIhSaIhEERNS4_7HashMapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEmNS4_17DefaultHashPolicyISJ_EEEEbSJ_EEERPT_St20_Sp_alloc_shared_tagIT0_EDpOT1_+0xf7)[0x2d93cc7]
faasm-dev-worker-8   | /build/faasm/bin/pool_runner(_ZNSt12__shared_ptrIN4WAVM7LLVMJIT6ModuleELN9__gnu_cxx12_Lock_policyE2EEC2ISaIS2_EJRKSt6vectorIhSaIhEERNS0_7HashMapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEmNS0_17DefaultHashPolicyISJ_EEEEbSJ_EEESt20_Sp_alloc_shared_tagIT_EDpOT0_+0x90)[0x2d93bb0]
faasm-dev-worker-8   | /build/faasm/bin/pool_runner(_ZNSt10shared_ptrIN4WAVM7LLVMJIT6ModuleEEC2ISaIS2_EJRKSt6vectorIhSaIhEERNS0_7HashMapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEmNS0_17DefaultHashPolicyISH_EEEEbSH_EEESt20_Sp_alloc_shared_tagIT_EDpOT0_+0x7c)[0x2d93b0c]
faasm-dev-worker-8   | /build/faasm/bin/pool_runner(_ZSt15allocate_sharedIN4WAVM7LLVMJIT6ModuleESaIS2_EJRKSt6vectorIhSaIhEERNS0_7HashMapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEmNS0_17DefaultHashPolicyISF_EEEEbSF_EESt10shared_ptrIT_ERKT0_DpOT1_+0x7f)[0x2d93a2f]
faasm-dev-worker-8   | /build/faasm/bin/pool_runner(_ZSt11make_sharedIN4WAVM7LLVMJIT6ModuleEJRKSt6vectorIhSaIhEERNS0_7HashMapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEmNS0_17DefaultHashPolicyISE_EEEEbSE_EESt10shared_ptrIT_EDpOT0_+0x80)[0x2d88430]
faasm-dev-worker-8   | /build/faasm/bin/pool_runner(_ZN4WAVM7LLVMJIT10loadModuleERKSt6vectorIhSaIhEEONS_7HashMapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS0_15FunctionBindingENS_17DefaultHashPolicyISC_EEEEOS1_INS_2IR12FunctionTypeESaISJ_EEOS1_ISD_SaISD_EEOS1_INS0_12TableBindingESaISQ_EEOS1_INS0_13MemoryBindingESaISU_EEOS1_INS0_13GlobalBindingESaISY_EEOS1_INS0_20ExceptionTypeBindingESaIS12_EENS0_15InstanceBindingEmRKS1_IPNS_7Runtime19FunctionMutableDataESaIS19_EEOSC_+0xa3d)[0x2d84a8d]
faasm-dev-worker-8   | /build/faasm/bin/pool_runner(_ZN4WAVM7Runtime25instantiateModuleInternalEPNS0_11CompartmentERKSt10shared_ptrIKNS0_6ModuleEEOSt6vectorINS0_21FunctionImportBindingESaISA_EEOS9_IPNS0_5TableESaISF_EEOS9_IPNS0_6MemoryESaISK_EEOS9_IPNS0_6GlobalESaISP_EEOS9_IPNS0_13ExceptionTypeESaISU_EEONSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKS3_INS0_13ResourceQuotaEE+0x1879)[0x2d9fdc9]
faasm-dev-worker-8   | /build/faasm/bin/pool_runner(_ZN4WAVM7Runtime17instantiateModuleEPNS0_11CompartmentERKSt10shared_ptrIKNS0_6ModuleEEOSt6vectorIPNS0_6ObjectESaISB_EEONSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKS3_INS0_13ResourceQuotaEE+0x864)[0x2d9e4f4]
faasm-dev-worker-8   | /build/faasm/bin/pool_runner(_ZN4wasm14WAVMWasmModule20createModuleInstanceERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES8_+0x6b4)[0x2aa6ff4]
faasm-dev-worker-8   | /build/faasm/bin/pool_runner(_ZN4wasm14WAVMWasmModule24doBindToFunctionInternalERN7faabric7MessageEbb+0x2b6)[0x2aa61f6]
faasm-dev-worker-8   | /build/faasm/bin/pool_runner(_ZN4wasm14WAVMWasmModule16doBindToFunctionERN7faabric7MessageEb+0x31)[0x2aa5f31]
faasm-dev-worker-8   | /build/faasm/bin/pool_runner(_ZN4wasm10WasmModule14bindToFunctionERN7faabric7MessageEb+0xeb)[0x2b72dab]
faasm-dev-worker-8   | /build/faasm/bin/pool_runner(_ZN4wasm15WAVMModuleCache15getCachedModuleERN7faabric7MessageE+0x215)[0x2ad5745]
faasm-dev-worker-8   | /build/faasm/bin/pool_runner(_ZN4wasm14WAVMWasmModule24doBindToFunctionInternalERN7faabric7MessageEbb+0x58)[0x2aa5f98]
faasm-dev-worker-8   | /build/faasm/bin/pool_runner(_ZN4wasm14WAVMWasmModule16doBindToFunctionERN7faabric7MessageEb+0x31)[0x2aa5f31]
faasm-dev-worker-8   | /build/faasm/bin/pool_runner(_ZN4wasm10WasmModule14bindToFunctionERN7faabric7MessageEb+0xeb)[0x2b72dab]
faasm-dev-worker-8   | /build/faasm/bin/pool_runner(_ZN7faaslet7FaasletC2ERN7faabric7MessageE+0x382)[0x2a84f72]
faasm-dev-worker-8   | /build/faasm/bin/pool_runner(_ZSt12construct_atIN7faaslet7FaasletEJRN7faabric7MessageEEEDTgsnwcvPvLi0E_T_pispclsr3stdE7declvalIT0_EEEEPS6_DpOS7_+0x2d)[0x2aa34cd]
faasm-dev-worker-8   | /build/faasm/bin/pool_runner(_ZNSt16allocator_traitsISaIN7faaslet7FaasletEEE9constructIS1_JRN7faabric7MessageEEEEvRS2_PT_DpOT0_+0x31)[0x2aa32d1]
faasm-dev-worker-8   | /build/faasm/bin/pool_runner(_ZNSt23_Sp_counted_ptr_inplaceIN7faaslet7FaasletESaIS1_ELN9__gnu_cxx12_Lock_policyE2EEC2IJRN7faabric7MessageEEEES2_DpOT_+0x81)[0x2aa3051]
faasm-dev-worker-8   | /build/faasm/bin/pool_runner(_ZNSt14__shared_countILN9__gnu_cxx12_Lock_policyE2EEC2IN7faaslet7FaasletESaIS5_EJRN7faabric7MessageEEEERPT_St20_Sp_alloc_shared_tagIT0_EDpOT1_+0x9d)[0x2aa2e6d]
faasm-dev-worker-8   | Assertion failed at /build/faasm/_deps/wavm_ext-src/Lib/Runtime/Compartment.cpp(43): !instances.size()
faasm-dev-worker-8   | Call stack:
faasm-dev-worker-8   |   /build/faasm/bin/pool_runner!WAVM::Runtime::Compartment::~Compartment()+426
faasm-dev-worker-8   |   /build/faasm/bin/pool_runner!WAVM::Runtime::Compartment::~Compartment()+24
faasm-dev-worker-8   |   /build/faasm/bin/pool_runner+46550644
faasm-dev-worker-8   |   /build/faasm/bin/pool_runner!WAVM::Runtime::tryCollectCompartment(WAVM::Runtime::GCPointer<WAVM::Runtime::Compartment>&&)+46
faasm-dev-worker-8   |   /build/faasm/bin/pool_runner!wasm::WAVMWasmModule::doWAVMGarbageCollection()+432
faasm-dev-worker-8   |   /build/faasm/bin/pool_runner!wasm::WAVMWasmModule::~WAVMWasmModule()+42
faasm-dev-worker-8   |   /build/faasm/bin/pool_runner!std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, wasm::WAVMWasmModule>::~pair()+28
faasm-dev-worker-8   |   /build/faasm/bin/pool_runner!void std::destroy_at<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, wasm::WAVMWasmModule> >(std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, wasm::WAVMWasmModule>*)+20
faasm-dev-worker-8   |   /build/faasm/bin/pool_runner!void std::allocator_traits<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, wasm::WAVMWasmModule>, true> > >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, wasm::WAVMWasmModule> >(std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, wasm::WAVMWasmModule>, true> >&, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, wasm::WAVMWasmModule>*)+24
faasm-dev-worker-8   |   /build/faasm/bin/pool_runner!std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, wasm::WAVMWasmModule>, true> > >::_M_deallocate_node(std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, wasm::WAVMWasmModule>, true>*)+57
faasm-dev-worker-8   |   /build/faasm/bin/pool_runner!std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, wasm::WAVMWasmModule>, true> > >::_M_deallocate_nodes(std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, wasm::WAVMWasmModule>, true>*)+68
faasm-dev-worker-8   |   /build/faasm/bin/pool_runner!std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, wasm::WAVMWasmModule>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, wasm::WAVMWasmModule> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear()+53
faasm-dev-worker-8   |   /build/faasm/bin/pool_runner!std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, wasm::WAVMWasmModule>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, wasm::WAVMWasmModule> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable()+24
faasm-dev-worker-8   |   /build/faasm/bin/pool_runner!std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, wasm::WAVMWasmModule, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, wasm::WAVMWasmModule> > >::~unordered_map()+20
faasm-dev-worker-8   |   /build/faasm/bin/pool_runner!wasm::WAVMModuleCache::~WAVMModuleCache()+24
faasm-dev-worker-8   |   /lib/x86_64-linux-gnu/libc.so.6+288934
faasm-dev-worker-8   |   /lib/x86_64-linux-gnu/libc.so.6!exit+31
faasm-dev-worker-8   |   /build/faasm/bin/pool_runner!faabric::util::handleCrash(int)+259
faasm-dev-worker-8   |   /build/faasm/bin/pool_runner!crashHandler(int)+18
faasm-dev-worker-8   |   /lib/x86_64-linux-gnu/libc.so.6+274575
faasm-dev-worker-8   |   /lib/x86_64-linux-gnu/libc.so.6!gsignal+202
faasm-dev-worker-8   |   /lib/x86_64-linux-gnu/libc.so.6!abort+298
faasm-dev-worker-8   |   /build/faasm/bin/pool_runner!llvm::llvm_unreachable_internal(char const*, char const*, unsigned int)+448
faasm-dev-worker-8   |   /build/faasm/bin/pool_runner!std::unique_ptr<llvm::object::ObjectFile, std::default_delete<llvm::object::ObjectFile> > llvm::cantFail<std::unique_ptr<llvm::object::ObjectFile, std::default_delete<llvm::object::ObjectFile> > >(llvm::Expected<std::unique_ptr<llvm::object::ObjectFile, std::default_delete<llvm::object::ObjectFile> > >, char const*)+330
faasm-dev-worker-8   |   /build/faasm/bin/pool_runner!WAVM::LLVMJIT::Module::Module(std::vector<unsigned char, std::allocator<unsigned char> > const&, WAVM::HashMap<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long, WAVM::DefaultHashPolicy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&)+529
faasm-dev-worker-8   | munmap_chunk(): invalid pointer

Sometimes the reason for the failure is the following:

faasm-dev-worker-7   | Failure value returned from cantFail wrapped call
faasm-dev-worker-7   | The file was not recognized as a valid object file
faasm-dev-worker-7   | UNREACHABLE executed at /usr/lib/llvm-10/include/llvm/Support/Error.h:744!

The underlying messaging could be an issue, but I'm not familiar enough with the stack to judge this.

This issue causes the worker to crash. Other workers who depend on the failed worker (have slots scheduled on the failed worker) are subsequently throwing timeout errors, they look like this:

faasm-dev-worker-3   | [13:17:32] [116] [E] Failed getting response on tcp://192.168.80.12:8006: code 2
faasm-dev-worker-3   | [13:17:32] [116] [E] Task 372121103 threw exception. What: Error on waiting for response.

mmathys avatar Sep 12 '22 12:09 mmathys

I believe the error is thrown in here in WAVMWasmModule.cpp.

This error does not occur when using only one worker node with 2 slots, 8 mpi processes per task, 4 tasks (sent at the same time) (ran 10 times).

mmathys avatar Sep 15 '22 11:09 mmathys

Repro steps:

# Check out branch for repro on faasm fork
cd <path to faasm>
git remote add fork [email protected]:mmathys/faasm.git
git checkout issue673-repro

# in faasm CLI
inv dev.cmake
inv dev.cc pool_runner

# start faasm local cluster on host (not faasm CLI)
source bin/workon.sh
./bin/refresh_local.sh
inv cluster.start --workers 6
inv k8s.ini-file --local

# switch to faasm/experiments_base and check out on remote branch
cd <path to experiments-base>
git remote add fork [email protected]:mmathys/experiment-base.git
git checkout issue673-repro
# we're using a fork of faasm/experiments_mpi, update submodule
git submodule update --recursive

# Run LAMMPS experiment according to https://github.com/mmathys/granny-reproduce-results/blob/main/reproduce_results.md
cd experiments/mpi
source ../../bin/workon.sh
inv lammps.wasm.upload
inv lammps.data.upload --bench network --bench compute
# Run modified task to produce the race condition. If finished successfully, rerun again until the error occurs.
inv lammps.run.race

# In another terminal, observe logs. Filter out noisy sbrk logs (output of WASI malloc foreign calls)
cd <path to faasm>
docker compose logs worker -f | grep -v sbrk

# The cluster may freeze without throwing an error – in that case, restart it.

mmathys avatar Sep 15 '22 11:09 mmathys

Closing as we don't see the error anymore

csegarragonz avatar Dec 05 '22 10:12 csegarragonz