Initial WASM support.
I finally had some free time and started to work on the WASM SIMD support. Emscripten translates NEON intrinsics to WASM_SIMD intrinsincs, while not all the operations are ported, it's good initial step i guess.
Thank you, @Sero1000! Any chance you have performance benchmarks comparing WASM performance to native code? Is there a programmatic API to check if NEON is enabled at runtime?
I don't think there is a way to see if NEON is enabled at runtime. At least I haven't seen it in the documentation, regarding the benchmark I am looking into bench.cxx. I just wanted to open a PR to get some feedback and discussion started, since I have touched some part of the interface.
I ran some benchmarks. In every method the SIMD is faster, besides hamming_b8 and jaccard_b8.
-------------------------------------------------------------------------------------------------------------
Benchmark WASM Time CPU Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
dot_f16_neon<1536d>/min_time:10.000/threads:1 3757 ns 3757 ns 3646765 abs_delta=6.36803n bytes=67.9446M/s pairs=266.174k/s relative_error=733.293n
dot_f32_neon<1536d>/min_time:10.000/threads:1 297 ns 297 ns 47534715 abs_delta=7.05499n bytes=303.964M/s pairs=3.37118M/s relative_error=1.49816u
dot_f16c_neon<1536d>/min_time:10.000/threads:1 8669 ns 8669 ns 1620263 abs_delta=6.87199n bytes=194.345M/s pairs=115.347k/s relative_error=913.666n
dot_f32c_neon<1536d>/min_time:10.000/threads:1 597 ns 597 ns 23325223 abs_delta=7.02965n bytes=144.441M/s pairs=1.67616M/s relative_error=1.15408u
cos_f16_neon<1536d>/min_time:10.000/threads:1 4100 ns 4100 ns 3423223 abs_delta=21.6182n bytes=274.464M/s pairs=243.887k/s relative_error=21.6787n
cos_f32_neon<1536d>/min_time:10.000/threads:1 322 ns 322 ns 43503778 abs_delta=7.43692n bytes=142.509M/s pairs=3.1022M/s relative_error=7.51849n
l2sq_f16_neon<1536d>/min_time:10.000/threads:1 3750 ns 3750 ns 3649013 abs_delta=382.132n bytes=69.0376M/s pairs=266.666k/s relative_error=193.164n
l2sq_f32_neon<1536d>/min_time:10.000/threads:1 296 ns 296 ns 47203587 abs_delta=213.066n bytes=15.5219M/s pairs=3.37502M/s relative_error=107.258n
hamming_b8_neon<1536d>/min_time:10.000/threads:1 8672 ns 8672 ns 1621155 abs_delta=0 bytes=48.7387M/s pairs=115.31k/s relative_error=0
jaccard_b8_neon<1536d>/min_time:10.000/threads:1 17235 ns 17235 ns 811441 abs_delta=0 bytes=178.242M/s pairs=58.0215k/s relative_error=0
kl_f32_neon<1536d>/min_time:10.000/threads:1 1800 ns 1800 ns 7801078 abs_delta=nan bytes=97.5784M/s pairs=555.484k/s relative_error=nan
js_f32_neon<1536d>/min_time:10.000/threads:1 2972 ns 2972 ns 4716064 abs_delta=nan bytes=150.995M/s pairs=336.465k/s relative_error=nan
dot_f16_serial<1536d>/min_time:10.000/threads:1 8687 ns 8687 ns 1609608 abs_delta=13.1164n bytes=92.9335M/s pairs=115.111k/s relative_error=1.91143u
dot_f32_serial<1536d>/min_time:10.000/threads:1 1101 ns 1101 ns 12749638 abs_delta=13.9628n bytes=145.953M/s pairs=908.294k/s relative_error=2.21015u
dot_f16c_serial<1536d>/min_time:10.000/threads:1 14876 ns 14876 ns 950814 abs_delta=9.16103n bytes=218.72M/s pairs=67.2219k/s relative_error=1045.37n
dot_f32c_serial<1536d>/min_time:10.000/threads:1 1517 ns 1517 ns 9269163 abs_delta=7.53501n bytes=11.785M/s pairs=659.312k/s relative_error=1034.78n
cos_f16_serial<1536d>/min_time:10.000/threads:1 11239 ns 11239 ns 1263726 abs_delta=28.8175n bytes=244.266M/s pairs=88.9747k/s relative_error=29.0959n
cos_f32_serial<1536d>/min_time:10.000/threads:1 1141 ns 1141 ns 12289692 abs_delta=24.6526n bytes=49.3568M/s pairs=876.712k/s relative_error=24.9056n
l2sq_f16_serial<1536d>/min_time:10.000/threads:1 10551 ns 10551 ns 1316720 abs_delta=1.25749u bytes=273.153M/s pairs=94.7746k/s relative_error=633.407n
l2sq_f32_serial<1536d>/min_time:10.000/threads:1 1120 ns 1120 ns 12474250 abs_delta=873.252n bytes=211.832M/s pairs=892.801k/s relative_error=439.914n
hamming_b8_serial<1536d>/min_time:10.000/threads:1 805 ns 805 ns 17305283 abs_delta=0 bytes=116.465M/s pairs=1.24241M/s relative_error=0
jaccard_b8_serial<1536d>/min_time:10.000/threads:1 1584 ns 1584 ns 8825746 abs_delta=0 bytes=96.074M/s pairs=631.419k/s relative_error=0
kl_f32_serial<1536d>/min_time:10.000/threads:1 21264 ns 21264 ns 657289 abs_delta=nan bytes=270.58M/s pairs=47.0277k/s relative_error=nan
js_f32_serial<1536d>/min_time:10.000/threads:1 33607 ns 33607 ns 417679 abs_delta=nan bytes=59.662M/s pairs=29.7557k/s relative_error=nan
--------------------------------------------------------------------------------------------------------------
Benchmark Native(Haswell) Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------
dot_f16_haswell<1536d>/min_time:10.000/threads:1 239 ns 239 ns 57788068 abs_delta=4.49447n bytes=25.7194G/s pairs=4.18609M/s relative_error=838.368n
dot_f32_haswell<1536d>/min_time:10.000/threads:1 229 ns 229 ns 61695066 abs_delta=4.05283n bytes=53.7205G/s pairs=4.37178M/s relative_error=618.113n
dot_f16c_haswell<1536d>/min_time:10.000/threads:1 485 ns 485 ns 29233506 abs_delta=5.26774n bytes=25.3399G/s pairs=2.06217M/s relative_error=1053.19n
dot_f32c_haswell<1536d>/min_time:10.000/threads:1 479 ns 479 ns 29155417 abs_delta=5.35522n bytes=51.3474G/s pairs=2.08933M/s relative_error=914.997n
cos_f16_haswell<1536d>/min_time:10.000/threads:1 248 ns 248 ns 56688499 abs_delta=21.2755n bytes=24.7348G/s pairs=4.02585M/s relative_error=21.336n
cos_f32_haswell<1536d>/min_time:10.000/threads:1 245 ns 245 ns 57090484 abs_delta=4.06642n bytes=50.2399G/s pairs=4.08853M/s relative_error=4.10406n
l2sq_f16_haswell<1536d>/min_time:10.000/threads:1 243 ns 243 ns 58014704 abs_delta=306.799n bytes=25.3349G/s pairs=4.12353M/s relative_error=154.647n
l2sq_f32_haswell<1536d>/min_time:10.000/threads:1 232 ns 232 ns 60353965 abs_delta=110.947n bytes=53.0122G/s pairs=4.31415M/s relative_error=56.0062n
hamming_b8_haswell<1536d>/min_time:10.000/threads:1 103 ns 103 ns 137056450 abs_delta=0 bytes=29.7774G/s pairs=9.69315M/s relative_error=0
jaccard_b8_haswell<1536d>/min_time:10.000/threads:1 141 ns 141 ns 99841876 abs_delta=0 bytes=21.8257G/s pairs=7.10474M/s relative_error=0
dot_f16_serial<1536d>/min_time:10.000/threads:1 6924 ns 6922 ns 2023513 abs_delta=12.4463n bytes=887.546M/s pairs=144.457k/s relative_error=1.70755u
dot_f32_serial<1536d>/min_time:10.000/threads:1 1063 ns 1063 ns 13210677 abs_delta=14.2338n bytes=11.5633G/s pairs=941.021k/s relative_error=2.37246u
dot_f16c_serial<1536d>/min_time:10.000/threads:1 14483 ns 14480 ns 972223 abs_delta=9.16103n bytes=848.606M/s pairs=69.0598k/s relative_error=1045.37n
dot_f32c_serial<1536d>/min_time:10.000/threads:1 2303 ns 2303 ns 6104887 abs_delta=6.96473n bytes=10.6735G/s pairs=434.304k/s relative_error=942.332n
cos_f16_serial<1536d>/min_time:10.000/threads:1 7174 ns 7174 ns 1940213 abs_delta=30.3359n bytes=856.47M/s pairs=139.399k/s relative_error=30.6709n
cos_f32_serial<1536d>/min_time:10.000/threads:1 1110 ns 1110 ns 12660383 abs_delta=25.3824n bytes=11.0713G/s pairs=900.988k/s relative_error=25.6467n
l2sq_f16_serial<1536d>/min_time:10.000/threads:1 7130 ns 7128 ns 1969190 abs_delta=1.23103u bytes=861.91M/s pairs=140.285k/s relative_error=620.233n
l2sq_f32_serial<1536d>/min_time:10.000/threads:1 1068 ns 1068 ns 13112453 abs_delta=876.925n bytes=11.5097G/s pairs=936.663k/s relative_error=441.684n
hamming_b8_serial<1536d>/min_time:10.000/threads:1 733 ns 733 ns 18818066 abs_delta=0 bytes=4.18988G/s pairs=1.36389M/s relative_error=0
jaccard_b8_serial<1536d>/min_time:10.000/threads:1 1175 ns 1175 ns 11951942 abs_delta=0 bytes=2.61374G/s pairs=850.828k/s relative_error=0