blis icon indicating copy to clipboard operation
blis copied to clipboard

Armv8-A Row-major Kernel Improvements

Open xrq-phys opened this issue 3 years ago • 3 comments

Status This is a 8x6 row-major kernel for ARMv8-A so its internal structure is basically the same as the current 6x8 column-preferring one.

Updates

  • Instead of clearing C-microtile registers at beginning of the assembly, execute the first k-loop using fmul instead of fmla. Codepath within assembly is handled to (basically) not introduce additional branching cost.
  • Scatter prefetching code for C into microkernel loops.

Restrictions This kernel assumes hardware prefetching for packed A/B blocks (so as not to bother the pipeline with additional instructions or the DMA with additional loads). Older chips like ThunderX2 may not perform well with it since they may have no hardware prefetching at all, while newer ones like Amazon's C6g tend to be happier with it.

This update also contains somehow prerequisite changes for my gemmsup+packm work here which I'd also like to merge later as a BLIS sandbox.

xrq-phys avatar Dec 16 '22 17:12 xrq-phys

Thanks @xrq-phys! I've asked Jeff to take a look at the new kernel for feedback. I think he and his application could stand to benefit from this, given the inherent advantage row-preferring kernel have with left-sided trsm (which is the only trsm code path that BLIS implements).

Happy holidays! 🎄 🎁 🍾

fgvanzee avatar Dec 26 '22 00:12 fgvanzee

Hi there, I know this is a bit old but came across this change from this paper.

I was just wondering what the status was for having this (and other changes) merged upstream and/or if there was a plan to do so?

GodTamIt avatar Oct 03 '23 10:10 GodTamIt

Hey @GodTamIt, thanks for your inquiry. I guess we're still waiting on @jdiamondGitHub to look over this PR. I'll reach out to him separately as well.

fgvanzee avatar Oct 03 '23 18:10 fgvanzee