[BUG] Missing intrinsics for AArch32 instructions VMLA.F16 and VMLS.F16
Alongside VFMA.F16/VFMS.F16, AArch32 offers VMLA.F16/VMLS.F16 instructions which performs multiply-add operation with intermediate rounding. Importantly, the vector-by-vector lane form (e.g. VMLA.F16 Qd, Qn, Dm[x]) on AArch32 is supported only for VMLA/VMLS instructions, and not for VFMA/VFMS instructions.
The NEON intrinsics specification lacks intrinsics for the VMLA/VMLS instructions. In particular, it makes impossible to achieve peak performance on half-precision matrix-matrix multiplication in AArch32 using NEON intrinsics, because the optimal implementation would use the VMLA.F16 Qd, Qn, Dm[x] instructions.
I request that NEON specification be updated to include the following intrinsics for AArch32:
-
vmla_f16(VMLA.F16 Dd, Dn, Dm) -
vmls_f16(VMLS.F16 Dd, Dn, Dm) -
vmlaq_f16(VMLA.F16 Qd, Qn, Qm) -
vmlsq_f16(VMLS.F16 Qd, Qn, Qm) -
vmla_lane_f16(VMLA.F16 Dd, Dn, Dm[x]) -
vmls_lane_f16(VMLS.F16 Dd, Dn, Dm[x]) -
vmlaq_lane_f16/vmlaq_laneq_f16(VMLA.F16 Qd, Qn, Dm[x]) -
vmlsq_lane_f16/vmlsq_laneq_f16(VMLS.F16 Qd, Qn, Dm[x])
Hi @Maratyszcza , thanks for your issue report. And apologies for the late response.
If possible, we encourage you to contribute with a Pull Request that addresses this issue. We will be happy to review it.