Ryo Suzuki
Ryo Suzuki
I see your point, we indeed found that de-interleaving the complex numbers first was faster for highway on NEON & SVE. I'm not sure about the x86 side of things...
It's good to know that svcadd is already being used in highway! I think we're still missing a direct link to the svcmla instructions. Even when there are equivalent ways...
The use case we've tested out is a vector multiplication so the order needs to be preserved. You raise a good point on the difference between NEON and SVE, we...
> So the proposed op would call both vqdmlal and vqdmlal_high on NEON, and svqdmlalb + svqdmlalt on SVE2? yes that is correct, I'm still trying to find out why...
So it looks like the doubling is just a side effect of the way the multiplication is done. The multiplication is done in Q0.15 format which is saturated to Q0.31...
@jan-wassenberg yes, that is perfect! I've added a comment on the PR requesting support for 8 bit and 16 bit elements as well
Yes, the use case is running on V1 and when there are some scalable vectors used in parts of the code where fixed sized vectors are used in other parts...
I think that would be the ideal solution but for now, how would one specify whether to use NEON or SVE on a per-function basis? I don't envision using NEON...
> Thank you @Ryo-not-rio ! Based upon who your employer is, I presume your test cases were verified on real hardware, not QEMU? Yes, they have been tested on real...
Looks like the high_n unit tests are incorrect, looking into it