[BUG] Missing intrinsics for AArch32 instructions VMLA.F16 and VMLS.F16

Open Maratyszcza opened this issue 3 years ago • 1 comments

Alongside VFMA.F16/VFMS.F16, AArch32 offers VMLA.F16/VMLS.F16 instructions which performs multiply-add operation with intermediate rounding. Importantly, the vector-by-vector lane form (e.g. VMLA.F16 Qd, Qn, Dm[x]) on AArch32 is supported only for VMLA/VMLS instructions, and not for VFMA/VFMS instructions.

The NEON intrinsics specification lacks intrinsics for the VMLA/VMLS instructions. In particular, it makes impossible to achieve peak performance on half-precision matrix-matrix multiplication in AArch32 using NEON intrinsics, because the optimal implementation would use the VMLA.F16 Qd, Qn, Dm[x] instructions.

I request that NEON specification be updated to include the following intrinsics for AArch32:

vmla_f16 (VMLA.F16 Dd, Dn, Dm)
vmls_f16 (VMLS.F16 Dd, Dn, Dm)
vmlaq_f16 (VMLA.F16 Qd, Qn, Qm)
vmlsq_f16 (VMLS.F16 Qd, Qn, Qm)
vmla_lane_f16 (VMLA.F16 Dd, Dn, Dm[x])
vmls_lane_f16 (VMLS.F16 Dd, Dn, Dm[x])
vmlaq_lane_f16/vmlaq_laneq_f16 (VMLA.F16 Qd, Qn, Dm[x])
vmlsq_lane_f16/vmlsq_laneq_f16 (VMLS.F16 Qd, Qn, Dm[x])

Oct 02 '22 20:10 Maratyszcza

Hi @Maratyszcza , thanks for your issue report. And apologies for the late response.

If possible, we encourage you to contribute with a Pull Request that addresses this issue. We will be happy to review it.

Feb 16 '23 15:02 vhscampos