simd-math
simd-math copied to clipboard
parallel_reduce with native simd type => seg fault in OpenMP
Hello,
I was just trying to test a parallel_reduce (sum) using one of the native simd type and found a seg fault that seems to be associated with a wrong memory alignment in the return value of HostThreadTeamData::pool_reduce_local()
To illustrate this, I've updated avx.hpp to provide operator += (used in the reduce join operation), and used a custom reducer provided below.
// custom reducer for simd type (here avx)
template <class T, class Space>
struct SimdReducer {
public:
using simd_t = simd::simd<float,simd::simd_abi::native>;
//using simd_t = simd::simd<T,simd::simd_abi::pack<4>>;
using simd_storage_t = simd_t::simd_storage_t;
// Required
using reducer = SimdReducer<T, Space>;
using value_type = simd_t;
using value_type_storage = simd_storage_t;
using result_view_type = Kokkos::View<value_type, Space, Kokkos::MemoryUnmanaged>;
private:
result_view_type value;
public:
KOKKOS_INLINE_FUNCTION
SimdReducer(value_type& value_) : value(&value_) {}
// Required
KOKKOS_INLINE_FUNCTION
void join(value_type& dest, const value_type& src) const {
dest += src;
}
KOKKOS_INLINE_FUNCTION
void join(volatile value_type& dest, const volatile value_type& src) const {
dest += src;
}
KOKKOS_INLINE_FUNCTION
void init(value_type& val) const {
printf("before init %p\n",&val);
val = simd_t(0.0); // seg fault here
printf("after init\n");
}
KOKKOS_INLINE_FUNCTION
value_type& reference() const { return *value.data(); }
KOKKOS_INLINE_FUNCTION
result_view_type view() const { return value; }
KOKKOS_INLINE_FUNCTION
bool references_scalar() const { return true; }
};
- a
parallel_reducewith this reducer works fine if device is Serial, but gives me a segmentation fault when I use device OpenMP (whatever the number of threads) - If I change simd type to be e.g. simd_abi::pack<4>, the crash disappears, and it works fine.
- here when compiling for avx,
simd<float,simd::simd_abi::native>is 32 bytes, but when I print in reducer init the address of the reference value coming from the call topool_reduce_local()(in HostThreadTeamData), the address is 16 bytes aligned, but I think it should be 32 bytes aligned. I think this explains the seg fault.
I may be wrong but I think it is necessary to control alignment inside HostThreadTeamData so that the returned pointer is properly align.