highway icon indicating copy to clipboard operation
highway copied to clipboard

Support for highway ArgSort

Open Greenie0701 opened this issue 11 months ago • 25 comments

Hi Developers!

I could see highway repository supports variety of hardware to enable optimised sort algorithms. As one of such sort algorithms ArgSort are primarily used for many scientific libraries. Is there any plan to add support for ArgSort implementation from Highway? So that it can be used for popular Scientific libraries such as NumPy which can improve performance of ArgSort on ARM devices.

Thanks!

Greenie0701 avatar Feb 19 '25 09:02 Greenie0701

Hi, it is possible to get argsort reasonably efficiently using the existing VQSort calls. If you arrange the input so that 32 or 64-bit keys are in K32V32 or K64V64 structs, and place iota (0..N-1) into the value part of that struct, then the resulting values will be argsort. Does that fit your use case?

jan-wassenberg avatar Feb 20 '25 08:02 jan-wassenberg

Hi @jan-wassenberg thanks for your valuable information! I pretty new to highway source, kindly can you explain more precisely like how we can implement this probably with some example, like we can implement and call argsort function via VQSort.

Thanks in advance

Greenie0701 avatar Feb 20 '25 09:02 Greenie0701

@jan-wassenberg are you referring to this part of code vqsort_kv64a.cc

Greenie0701 avatar Feb 20 '25 09:02 Greenie0701

Yes, that's right. A somewhat more detailed sketch:

K32V32 kv[];
for (size_t i = 0; i < keys.size(); ++i) {
kv[i].key = keys[i];
kv[i].value = i;
}
VQSort(kv, keys.size(), SortAscending());
// output kv, or copy kv[i].value into output array

jan-wassenberg avatar Feb 20 '25 10:02 jan-wassenberg

@zafeerali943 wonderful! thanks for your valuable information! just one more clarification what are dtypes does this KV sort offers? Could it be applicable for int32 and int64 ? Also what could be the necessary source to build VQsort for this scenario

Greenie0701 avatar Feb 20 '25 10:02 Greenie0701

Yes, that's right. A somewhat more detailed sketch:

K32V32 kv[];
for (size_t i = 0; i < keys.size(); ++i) {
kv[i].key = keys[i];
kv[i].value = i;
}
VQSort(kv, keys.size(), SortAscending());
// output kv, or copy kv[i].value into output array

Just now I tried compiling sample source using clang-cl:

`#include "hwy/highway.h"

#include "hwy/contrib/sort/vqsort-inl.h"

int main(){ hwy::K32V32 kv[20]; int keys[20] = {184, 100, 61, 207, 58, 141, 5, 150, 115, 33, 72, 47, 195, 130, 213, 177, 225, 165, 72, 89};

for (size_t i = 0; i < 20; ++i) {
    kv[i].key = keys[i];
    kv[i].value = i;
}
hwy::VQSort(kv,20,hwy::SortAscending());
return 0;

} `

Error: lld-link: error: undefined symbol: void __cdecl hwy::VQSort(struct hwy::K32V32 *__restrict, unsigned __int64, struct hwy::SortAscending)

referenced by C:\Users\HCKTest\AppData\Local\Temp\kvsort-e27948.obj:(main)

Greenie0701 avatar Feb 20 '25 11:02 Greenie0701

Yes, for int32 you can use K32V32 and int64 K64V64. For calling VQSort, the correct and sufficient header is vqsort.h, but you will have to include in your project vqsort*.cc.

jan-wassenberg avatar Feb 20 '25 12:02 jan-wassenberg

I am trying to compile using LLVM's clang-cl on windows, can you give a sample source for the above use case to compile using clang-cl, I am facing linking errors

Greenie0701 avatar Feb 20 '25 13:02 Greenie0701

The above is as close to full source code as we are willing and able to provide.

jan-wassenberg avatar Feb 20 '25 14:02 jan-wassenberg

Hi @jan-wassenberg, thanks for your patience and support!

I tried including all vqsort sources in cmakelist and tried compiling my sample source, but ended up with someother undefined errors.

My Program:

#include "hwy/contrib/sort/vqsort.h"
int main(){
    hwy::K32V32 kv[20];
    int keys[20] = {184, 100, 61, 207, 58, 141, 5, 150, 115, 33, 72, 47, 195, 130, 213, 177, 225, 165, 72, 89};
    for (size_t i = 0; i < 20; ++i) {
        kv[i].key = keys[i];
        kv[i].value = i;
    }
    hwy::VQSort(kv,20,hwy::SortAscending()); 
    return 0;
};

My Cmakelist:

# CMakeLists.txt
cmake_minimum_required(VERSION 3.10)

# Set Clang-CL as the compiler
set(CMAKE_C_COMPILER "clang-cl")
set(CMAKE_CXX_COMPILER "clang-cl")

project(vqsort_example)

# Specify C++ standard
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

# Find all vqsort source files
file(GLOB VQSORT_SOURCES 
    "highway/hwy/contrib/sort/vqsort*.cc"
)

# Add source files
set(SOURCES
    samplekv.cpp
    ${VQSORT_SOURCES}
)

# Create executable
add_executable(sample_sort ${SOURCES})

# Add include directory for Highway
target_include_directories(sample_sort PRIVATE
    C:/Users/HCKTest/Desktop/highway
)


target_compile_options(sample_sort PRIVATE 
    /EHsc    
    -fuse-ld=lld  
)
```
Find the error in my attached logs:

[logs.txt](https://github.com/user-attachments/files/18904902/logs.txt)

Greenie0701 avatar Feb 21 '25 10:02 Greenie0701

@zafeerali943 the issue can be fixed by adding abort.cc file from hwy folder

MugundanMCW avatar Feb 21 '25 11:02 MugundanMCW

I was able to reproduce this issue, compiling with necessary cc source file included in the project would fix the above issue

MugundanMCW avatar Feb 21 '25 11:02 MugundanMCW

Yes, for int32 you can use K32V32 and int64 K64V64. For calling VQSort, the correct and sufficient header is vqsort.h, but you will have to include in your project vqsort*.cc.

@jan-wassenberg I just went through source code of KV sort sources, currently it supports only uint32\uint64 right? do you have any plan to support int32/int64 in future release?

MugundanMCW avatar Feb 21 '25 11:02 MugundanMCW

Hi @Mugundanmcw , that's right. Because we are likely anyway copying from the user's input to the KV type, it is probably best to then also apply a transformation that makes int32 sort in the desired order, when in fact the underlying sort is a u32. (Same also applies for int64 and u64.) This transformation extracts the sign bit, broadcasts it to all bits, then XORs the signed input with that. For example, uint32_t flip = (input < 0) ? ~0u : 0; transformed = static_cast<uint32_t>(input) ^ flip.

jan-wassenberg avatar Feb 21 '25 12:02 jan-wassenberg

Hi @Mugundanmcw , that's right. Because we are likely anyway copying from the user's input to the KV type, it is probably best to then also apply a transformation that makes int32 sort in the desired order, when in fact the underlying sort is a u32. (Same also applies for int64 and u64.) This transformation extracts the sign bit, broadcasts it to all bits, then XORs the signed input with that. For example, uint32_t flip = (input < 0) ? ~0u : 0; transformed = static_cast<uint32_t>(input) ^ flip.

@jan-wassenberg which means applying appropriate transformation for int32/int64 inputs, we can use the existing KV sort for int32/int64 too ?

Greenie0701 avatar Feb 24 '25 03:02 Greenie0701

@jan-wassenberg do you have any plans to support float32/64 for KV sort, as this could be used for NumPy as well where it currently lagging optimised support for ArgSort?

Thanks

MugundanMCW avatar Feb 24 '25 08:02 MugundanMCW

A similar transform is possible for float and double, such they could also be used with the existing code. See for example https://fgiesen.wordpress.com/2013/01/21/order-preserving-bijections/.

jan-wassenberg avatar Feb 24 '25 10:02 jan-wassenberg

@jan-wassenberg probably I will check with standalone CPP code with these transformations with KV and get back on this, between One doubt, wont these transformations affects overall execution and affect performance?

MugundanMCW avatar Feb 24 '25 11:02 MugundanMCW

Assuming we are anyway copying from an array of just keys into the KV struct, the transform is pretty much free especially if vectorized.

jan-wassenberg avatar Feb 24 '25 12:02 jan-wassenberg

@jan-wassenberg Does highway VQSort can be compiled with SVE support on WoA with clang compiler?

Greenie0701 avatar Mar 04 '25 10:03 Greenie0701

Yes, VQSort works with SVE. What do you mean by WoA? If it's Windows on Arm, I am not familiar but it ought to work.

jan-wassenberg avatar Mar 04 '25 12:03 jan-wassenberg

@jan-wassenberg yes whether it can be compiled on Windows on ARM that supports SVE with clang?

Greenie0701 avatar Mar 04 '25 13:03 Greenie0701

Yes, I would think so :)

jan-wassenberg avatar Mar 04 '25 14:03 jan-wassenberg

Yes, I would think so :)

@jan-wassenberg I have raised a PR in the NumPy repository to optimize Argsort using highway's KV VQSort for uint32 and uint64 dtypes. But unfortunately the patch received minimal response due to its memory constraints(for copying data to KV structure) and limitation for supporting other dtypes. Is there any possibility to add native support for argsort from VQSort which could benefit all dtypes and improve the performance significantly? This could be very useful for devices that lack support for optimised argsort in NumPy.

MugundanMCW avatar Mar 06 '25 02:03 MugundanMCW

Supporting other dtypes can certainly be done, as mentioned. But the extra memory is indeed an issue. It's unclear to me whether it's a bigger issue than not using an optimized sort at all. Allocating is not cheap, but I'd still expect there to be a net savings for large-enough arrays.

Implementing native sort is a huge change that I personally am not going to be able to do in the foreseeable future.

jan-wassenberg avatar Mar 06 '25 06:03 jan-wassenberg