Load index into Vector using highway ops
Hi Champs!
I am trying to load/store an index into a simd vector from an index array pointer of type(size_t* indices). Is there any relevant operator available in highway to perform such load and store operations?
Thanks
Hi, our lane types are fixed-size whereas size_t can be either 32-bit or 64-bit.
I'd recommend choosing an IndexT as either uint32_t #if SIZE_MAX == 0xFFFFFFFF else uint64_t. Then you can cast pointers to this type and use the normal Load/LoadU.
Hope this helps?
@jan-wassenberg Thanks for the response! It sounds good. I’m currently trying to load elements from an array with any data type using an index vector. I believe this can be achieved using the GatherIndex operation in Highway, but I’m facing some confusion regarding the following:
Let’s say I want to load int32 elements into a 128-bit SIMD vector using an index vector.
The index vector is 64 bits wide, meaning it can hold 2 indices per vector. However, int32 elements can be processed in sets of 4 at a time.
This leads me to wonder—using the GatherIndex operation to load 4 int32 elements into a vector with only 2 indices per vector seems impossible, right?
How can this situation be handled?
Yes, mixed-width gathers are problematic. Although x86 supports them, other platforms do not. Thus I would advise ensuring the width of the index and the to-be-gathered elements match. If you really must have size_t indices, then you'd probably want to DemoteTo them to u32 (unless SIZE_MAX == 0xFFFFFFFF) to match your int32 data.
@jan-wassenberg I understood, but array size always would not be of size less than int32 right? such cases we can't able to use gather right for all dtypes with index vector of either int32/int64 type?
This what I am trying to achieve: const size_t N = Lanes(d); const size_t* read = indices; const size_t iL0 = readL[0 * N]; alignas(alignof(T)) size_t scalar_indices[N]; StoreU(indices_vec, d, scalar_indices); Vec< D > arr = GatherIndex(d, arr, scalar_indices);
That's correct. CPUs usually do not support gather for less than 32-bit elements. Gather is anyway slow and best avoided if possible, unless you're going to use the resulting vector inside other SIMD code.
@jan-wassenberg I am currently trying to use basecase of vqsort to work with sort indices along with keys wihtout the use of KV structure, thats were I am trying to use gather and index vector here. Do you have any suggestions
hm, I'm not sure I understand. VQSort key+value only works with KV structures, unless you are defining u32 "keys" yourself which consist of the actual u16 key plus a u16 index. If you want to gather values, you can just use normal C++ code. This will be about as fast as vector code for that.
hm, I'm not sure I understand. VQSort key+value only works with KV structures, unless you are defining u32 "keys" yourself which consist of the actual u16 key plus a u16 index. If you want to gather values, you can just use normal C++ code. This will be about as fast as vector code for that.
@jan-wassenberg to be precise, I am trying to introduce two different pointers one is Key and another index pointer, trying modify basecase code to implement index swapping along with key swap.
Ah, got it. That will involve changing traits-inl.h
HWY_INLINE Vec<D> First(D /* tag */, const Vec<D> a, const Vec<D> b) const {
return Min(a, b);
}
to add an output argument for the indices, and instead of returning the min directly, getting a mask via Lt(a, b), then using that to return IfThenElse(mask, a, b) (which is the same as Min), but then also doing the IfThenElse on two also to be added index arguments. Again, a sizeable undertaking.
@jan-wassenberg that's sounds good!, I believe for basecase the size of elements would be less than 256, so we can handle the indices in uint16, the only thing I am still getting stuck is that how we can handle appropriate indices to dtype of input key.
For the code above, indices would be the same size as the dtype. You can use MakeUnsigned<T> to obtain an unsigned type of the same size as T.
@jan-wassenberg I could able to understand the VQSort calls for all DTypes, but how KV is actually handled, since keys are used for comparison, how values are getting rearranged accordingly to the keys?
For the code above, indices would be the same size as the dtype. You can use
MakeUnsigned<T>to obtain an unsigned type of the same size as T.
@jan-wassenberg Between I tried implementing the suggestions you gave; it got worked but while copying the sorted vectors back to indices array, it stores wrong results, not sure what goes wrong
KV is handled by including the value inside the key, and changing the key comparator to ignore the value part of the key.
@jan-wassenberg Kindly can you share the exact implementation details that performs the above case
@jan-wassenberg are you pointing to this piece of code :
struct OrderAscendingKV64 : public KeyValue64 { using Order = SortAscending; using OrderForSortingNetwork = OrderAscending<LaneType>;
HWY_INLINE bool Compare1(const LaneType* a, const LaneType* b) const { return (*a >> 32) < (*b >> 32); }
template <class D> HWY_INLINE Mask<D> Compare(D /* tag */, Vec<D> a, Vec<D> b) const { return Lt(ShiftRight<32>(a), ShiftRight<32>(b)); }
// Not required to be stable (preserving the order of equivalent keys), so // we can include the value in the comparison. template <class D> HWY_INLINE Vec<D> First(D /* tag */, const Vec<D> a, const Vec<D> b) const { return Min(a, b); }
template <class D> HWY_INLINE Vec<D> Last(D /* tag */, const Vec<D> a, const Vec<D> b) const { return Max(a, b); }
template <class D> HWY_INLINE Vec<D> FirstOfLanes(D d, Vec<D> v, uint64_t* HWY_RESTRICT /* buf */) const { return MinOfLanes(d, v); }
template <class D> HWY_INLINE Vec<D> LastOfLanes(D d, Vec<D> v, uint64_t* HWY_RESTRICT /* buf */) const { return MaxOfLanes(d, v); }
// Same as for regular lanes. template <class D> HWY_INLINE Vec<D> FirstValue(D d) const { return Set(d, hwy::LowestValue<TFromD<D>>()); }
template <class D> HWY_INLINE Vec<D> LastValue(D d) const { return Set(d, hwy::HighestValue<TFromD<D>>()); }
template <class D> HWY_INLINE Vec<D> PrevValue(D d, Vec<D> v) const { return Sub(v, Set(d, uint64_t{1} << 32)); } };
Yes, that's right. Compare is shifting out the value bits.
@jan-wassenberg, does the current VQSort implementation in [vqsort-ini.h] support all Highway-supported intrinsics, including x86 (AVX-512, AVX2), Arm (NEON, SVE), and RISC-V (V extension)? Are there any other intrinsic sets supported by VQSort?
Thanks in advance!
Yes indeed, you can see the full list in the readme or detect_targets.
Closing, feel free to reopen if you'd like to continue.