Peter Caday
Peter Caday
Omnibus pulldown from upstream gemmstone repository. Highlights include: * Long-overdue refactoring of register layouts into their own class, `RegisterLayout` * Numerous copy planner optimizations/fixes: - New dedicated upconversions to bf16...
Adds some f16 accumulation FMA strategies (opt-in with --attr-acc-mode=f16) for MTL. Theoretical peak is 2x faster than f32 accumulation and actual performance speedup is similar.
Backport of #3357 to `rls-v3.9-pc`.
Addresses MFDNN-13752. Some of the new strategies from #2788 run out of registers -- this PR reduces the m tile size, which avoids this and also seems to improve performance.
Backport of #3357 to `rls-v3.8`.
POC of nf4 weights decompression for Intel GPUs (MFDNN-13636), to allow OpenVINO to test it out. Adds a new nf4 data type (may not be final design -- just for...