X^T*B doesn't call BLAS gemm
For matrix $A$, $B$, calculating $A^TB$ should be able to use gemm routine, however, it doesn't
let a = Array2::<f64>::zeros((10000, 1000));
let b = Array2::<f64>::zeros((10000, 1000));
let _ = a.t().dot(&b);
However, if $A^T$ is provided directly, it will call gemm
let b = Array2::<f64>::zeros((10000, 1000));
let at = Array2::<f64>::zeros((1000, 10000));
let _ = at.dot(&b);
See code example here
https://github.com/fardream/rust-ndarray-t-dot
Related to #445
This code here needs to change https://github.com/rust-ndarray/ndarray/blob/07406955868dd98985d7e2f1de1f643be4d8888f/src/linalg/impl_linalg.rs#L393-L417
It needs to be rewritten to be more general. It has a comment there that I guess explains why it doesn't cover this case right now.
ndarray arrays can have more general strides than blas can handle, so there will always be arrays that can't be passed to blas, so ndarray needs to examine the arguments and figure out if and how the arrays can be used with blas.
In your example The ATB product comes in with the operands not square and the first operand in column major layout and the second in row major layout. The impl just needs to figure that out and how to call blas with it. I wonder if the check for square dimensions can be removed, not sure why it's there.
Thanks for the good test case. I've verified that it's fixed in your test case. BLAS does better than the matrixmultiply fallback, but the matrixmultiply fallback can do well if feature matrixmultiply-threading is enabled.