CMSIS-DSP MVE arm_scale_q15 does not operate as specified

The MVE implementation of arm_scale_q15 does not use a 2.30 intermediate, as per the documentation and other implementations, but truncates the multiplication immediately to 2.14. This reduces the output range and produces irregular steps when scaling up significantly, for example, in extremis:

Given scaleFract = 0x6666, shift = 13

input	2.30 intermediate	non-MVE result	2.14 intermediate	MVE result
0x0000	0x00000000	0x0000	0x0000	0x0000
0x0001	0x00006666	0x1999	0x0000	0z0000
0x0002	0x0000CCCC	0x3333	0x0000	0x0000
0x0003	0x00013332	0x4CCC	0x0001	0x4000
0x0004	0x00019998	0x6666	0x0001	0x4000
0x0005	0x0001FFFE	0x7FFF	0x0001	0x4000

I've not checked how far this extends to other MVE variants - this was the function I was using and noticed.

Oct 19 '22 05:10 kjbracey

@kjbracey Thank for reporting this. It is a bug.

MVE versions are not strictly identical to the scalar versions and may give a slightly different result. It was required (sometimes) for vectorizing more easily.

But the difference should be minimal. Here, it is clearly not a small difference.

Oct 20 '22 06:10 christophe0606

@kjbracey It looks like the scalar versions are also not consistent. q31 is first doing a >> 32 before applying the shift.

So it is not just a consistency issue with MVE versions.

Oct 20 '22 07:10 christophe0606

So it does. Yet the high-precision intermediate is very clearly specified, and would seem to be a key feature of this specific function.

Although other characteristics might be suitable for most users - eg I replaced my call to q15 with one using a float16 intermediate because always getting a consistent 11 bits of precision was fine.

Maybe there could be an API fork between the "high precision, no loss near unity gain" expensive version as currently specified and a "pretty good precision" performance version.

Oct 20 '22 07:10 kjbracey

I have changed the MVE implementations for arm_scale_q15 and arm_scale_q7. Now they behave like the scalar version but they take more cycles.

For the q31 (both scalar and MVE) : nothing has changed but I modified the documentation to make to make it clear it is doing a shift by 32 before applying the shift argument so there is a loss of accuracy.

Oct 24 '22 05:10 christophe0606

Seems reasonable to me. I haven't tested it yet though. Thanks!

Oct 25 '22 11:10 kjbracey