SimSIMD icon indicating copy to clipboard operation
SimSIMD copied to clipboard

Initial WASM support.

Open Sero1000 opened this issue 1 year ago • 3 comments

I finally had some free time and started to work on the WASM SIMD support. Emscripten translates NEON intrinsics to WASM_SIMD intrinsincs, while not all the operations are ported, it's good initial step i guess.

Sero1000 avatar Dec 17 '24 20:12 Sero1000

Thank you, @Sero1000! Any chance you have performance benchmarks comparing WASM performance to native code? Is there a programmatic API to check if NEON is enabled at runtime?

ashvardanian avatar Dec 17 '24 20:12 ashvardanian

I don't think there is a way to see if NEON is enabled at runtime. At least I haven't seen it in the documentation, regarding the benchmark I am looking into bench.cxx. I just wanted to open a PR to get some feedback and discussion started, since I have touched some part of the interface.

Sero1000 avatar Dec 18 '24 20:12 Sero1000

I ran some benchmarks. In every method the SIMD is faster, besides hamming_b8 and jaccard_b8.

-------------------------------------------------------------------------------------------------------------
Benchmark WASM                                            Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
dot_f16_neon<1536d>/min_time:10.000/threads:1            3757 ns         3757 ns      3646765 abs_delta=6.36803n bytes=67.9446M/s pairs=266.174k/s relative_error=733.293n
dot_f32_neon<1536d>/min_time:10.000/threads:1             297 ns          297 ns     47534715 abs_delta=7.05499n bytes=303.964M/s pairs=3.37118M/s relative_error=1.49816u
dot_f16c_neon<1536d>/min_time:10.000/threads:1           8669 ns         8669 ns      1620263 abs_delta=6.87199n bytes=194.345M/s pairs=115.347k/s relative_error=913.666n
dot_f32c_neon<1536d>/min_time:10.000/threads:1            597 ns          597 ns     23325223 abs_delta=7.02965n bytes=144.441M/s pairs=1.67616M/s relative_error=1.15408u
cos_f16_neon<1536d>/min_time:10.000/threads:1            4100 ns         4100 ns      3423223 abs_delta=21.6182n bytes=274.464M/s pairs=243.887k/s relative_error=21.6787n
cos_f32_neon<1536d>/min_time:10.000/threads:1             322 ns          322 ns     43503778 abs_delta=7.43692n bytes=142.509M/s pairs=3.1022M/s relative_error=7.51849n
l2sq_f16_neon<1536d>/min_time:10.000/threads:1           3750 ns         3750 ns      3649013 abs_delta=382.132n bytes=69.0376M/s pairs=266.666k/s relative_error=193.164n
l2sq_f32_neon<1536d>/min_time:10.000/threads:1            296 ns          296 ns     47203587 abs_delta=213.066n bytes=15.5219M/s pairs=3.37502M/s relative_error=107.258n
hamming_b8_neon<1536d>/min_time:10.000/threads:1         8672 ns         8672 ns      1621155 abs_delta=0 bytes=48.7387M/s pairs=115.31k/s relative_error=0
jaccard_b8_neon<1536d>/min_time:10.000/threads:1        17235 ns        17235 ns       811441 abs_delta=0 bytes=178.242M/s pairs=58.0215k/s relative_error=0
kl_f32_neon<1536d>/min_time:10.000/threads:1             1800 ns         1800 ns      7801078 abs_delta=nan bytes=97.5784M/s pairs=555.484k/s relative_error=nan
js_f32_neon<1536d>/min_time:10.000/threads:1             2972 ns         2972 ns      4716064 abs_delta=nan bytes=150.995M/s pairs=336.465k/s relative_error=nan
dot_f16_serial<1536d>/min_time:10.000/threads:1          8687 ns         8687 ns      1609608 abs_delta=13.1164n bytes=92.9335M/s pairs=115.111k/s relative_error=1.91143u
dot_f32_serial<1536d>/min_time:10.000/threads:1          1101 ns         1101 ns     12749638 abs_delta=13.9628n bytes=145.953M/s pairs=908.294k/s relative_error=2.21015u
dot_f16c_serial<1536d>/min_time:10.000/threads:1        14876 ns        14876 ns       950814 abs_delta=9.16103n bytes=218.72M/s pairs=67.2219k/s relative_error=1045.37n
dot_f32c_serial<1536d>/min_time:10.000/threads:1         1517 ns         1517 ns      9269163 abs_delta=7.53501n bytes=11.785M/s pairs=659.312k/s relative_error=1034.78n
cos_f16_serial<1536d>/min_time:10.000/threads:1         11239 ns        11239 ns      1263726 abs_delta=28.8175n bytes=244.266M/s pairs=88.9747k/s relative_error=29.0959n
cos_f32_serial<1536d>/min_time:10.000/threads:1          1141 ns         1141 ns     12289692 abs_delta=24.6526n bytes=49.3568M/s pairs=876.712k/s relative_error=24.9056n
l2sq_f16_serial<1536d>/min_time:10.000/threads:1        10551 ns        10551 ns      1316720 abs_delta=1.25749u bytes=273.153M/s pairs=94.7746k/s relative_error=633.407n
l2sq_f32_serial<1536d>/min_time:10.000/threads:1         1120 ns         1120 ns     12474250 abs_delta=873.252n bytes=211.832M/s pairs=892.801k/s relative_error=439.914n
hamming_b8_serial<1536d>/min_time:10.000/threads:1        805 ns          805 ns     17305283 abs_delta=0 bytes=116.465M/s pairs=1.24241M/s relative_error=0
jaccard_b8_serial<1536d>/min_time:10.000/threads:1       1584 ns         1584 ns      8825746 abs_delta=0 bytes=96.074M/s pairs=631.419k/s relative_error=0
kl_f32_serial<1536d>/min_time:10.000/threads:1          21264 ns        21264 ns       657289 abs_delta=nan bytes=270.58M/s pairs=47.0277k/s relative_error=nan
js_f32_serial<1536d>/min_time:10.000/threads:1          33607 ns        33607 ns       417679 abs_delta=nan bytes=59.662M/s pairs=29.7557k/s relative_error=nan

--------------------------------------------------------------------------------------------------------------
Benchmark Native(Haswell)                                  Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------
dot_f16_haswell<1536d>/min_time:10.000/threads:1           239 ns          239 ns     57788068 abs_delta=4.49447n bytes=25.7194G/s pairs=4.18609M/s relative_error=838.368n
dot_f32_haswell<1536d>/min_time:10.000/threads:1           229 ns          229 ns     61695066 abs_delta=4.05283n bytes=53.7205G/s pairs=4.37178M/s relative_error=618.113n
dot_f16c_haswell<1536d>/min_time:10.000/threads:1          485 ns          485 ns     29233506 abs_delta=5.26774n bytes=25.3399G/s pairs=2.06217M/s relative_error=1053.19n
dot_f32c_haswell<1536d>/min_time:10.000/threads:1          479 ns          479 ns     29155417 abs_delta=5.35522n bytes=51.3474G/s pairs=2.08933M/s relative_error=914.997n
cos_f16_haswell<1536d>/min_time:10.000/threads:1           248 ns          248 ns     56688499 abs_delta=21.2755n bytes=24.7348G/s pairs=4.02585M/s relative_error=21.336n
cos_f32_haswell<1536d>/min_time:10.000/threads:1           245 ns          245 ns     57090484 abs_delta=4.06642n bytes=50.2399G/s pairs=4.08853M/s relative_error=4.10406n
l2sq_f16_haswell<1536d>/min_time:10.000/threads:1          243 ns          243 ns     58014704 abs_delta=306.799n bytes=25.3349G/s pairs=4.12353M/s relative_error=154.647n
l2sq_f32_haswell<1536d>/min_time:10.000/threads:1          232 ns          232 ns     60353965 abs_delta=110.947n bytes=53.0122G/s pairs=4.31415M/s relative_error=56.0062n
hamming_b8_haswell<1536d>/min_time:10.000/threads:1        103 ns          103 ns    137056450 abs_delta=0 bytes=29.7774G/s pairs=9.69315M/s relative_error=0
jaccard_b8_haswell<1536d>/min_time:10.000/threads:1        141 ns          141 ns     99841876 abs_delta=0 bytes=21.8257G/s pairs=7.10474M/s relative_error=0
dot_f16_serial<1536d>/min_time:10.000/threads:1           6924 ns         6922 ns      2023513 abs_delta=12.4463n bytes=887.546M/s pairs=144.457k/s relative_error=1.70755u
dot_f32_serial<1536d>/min_time:10.000/threads:1           1063 ns         1063 ns     13210677 abs_delta=14.2338n bytes=11.5633G/s pairs=941.021k/s relative_error=2.37246u
dot_f16c_serial<1536d>/min_time:10.000/threads:1         14483 ns        14480 ns       972223 abs_delta=9.16103n bytes=848.606M/s pairs=69.0598k/s relative_error=1045.37n
dot_f32c_serial<1536d>/min_time:10.000/threads:1          2303 ns         2303 ns      6104887 abs_delta=6.96473n bytes=10.6735G/s pairs=434.304k/s relative_error=942.332n
cos_f16_serial<1536d>/min_time:10.000/threads:1           7174 ns         7174 ns      1940213 abs_delta=30.3359n bytes=856.47M/s pairs=139.399k/s relative_error=30.6709n
cos_f32_serial<1536d>/min_time:10.000/threads:1           1110 ns         1110 ns     12660383 abs_delta=25.3824n bytes=11.0713G/s pairs=900.988k/s relative_error=25.6467n
l2sq_f16_serial<1536d>/min_time:10.000/threads:1          7130 ns         7128 ns      1969190 abs_delta=1.23103u bytes=861.91M/s pairs=140.285k/s relative_error=620.233n
l2sq_f32_serial<1536d>/min_time:10.000/threads:1          1068 ns         1068 ns     13112453 abs_delta=876.925n bytes=11.5097G/s pairs=936.663k/s relative_error=441.684n
hamming_b8_serial<1536d>/min_time:10.000/threads:1         733 ns          733 ns     18818066 abs_delta=0 bytes=4.18988G/s pairs=1.36389M/s relative_error=0
jaccard_b8_serial<1536d>/min_time:10.000/threads:1        1175 ns         1175 ns     11951942 abs_delta=0 bytes=2.61374G/s pairs=850.828k/s relative_error=0

Sero1000 avatar Dec 29 '24 19:12 Sero1000