blis Armv8-A Row-major Kernel Improvements

Status This is a 8x6 row-major kernel for ARMv8-A so its internal structure is basically the same as the current 6x8 column-preferring one.

Updates

Instead of clearing C-microtile registers at beginning of the assembly, execute the first k-loop using fmul instead of fmla. Codepath within assembly is handled to (basically) not introduce additional branching cost.
Scatter prefetching code for C into microkernel loops.

Restrictions This kernel assumes hardware prefetching for packed A/B blocks (so as not to bother the pipeline with additional instructions or the DMA with additional loads). Older chips like ThunderX2 may not perform well with it since they may have no hardware prefetching at all, while newer ones like Amazon's C6g tend to be happier with it.

This update also contains somehow prerequisite changes for my gemmsup+packm work here which I'd also like to merge later as a BLIS sandbox.

Dec 16 '22 17:12 xrq-phys

Thanks @xrq-phys! I've asked Jeff to take a look at the new kernel for feedback. I think he and his application could stand to benefit from this, given the inherent advantage row-preferring kernel have with left-sided trsm (which is the only trsm code path that BLIS implements).

Happy holidays! 🎄 🎁 🍾

Dec 26 '22 00:12 fgvanzee

Hi there, I know this is a bit old but came across this change from this paper.

I was just wondering what the status was for having this (and other changes) merged upstream and/or if there was a plan to do so?

Oct 03 '23 10:10 GodTamIt

Hey @GodTamIt, thanks for your inquiry. I guess we're still waiting on @jdiamondGitHub to look over this PR. I'll reach out to him separately as well.

Oct 03 '23 18:10 fgvanzee