Einsum support: From external crate
We'd like functionality like einsum, that would be awesome. Einsum is such a big feature that the preferred approach would be that a separate crate implements einsum.
(This information has been mentioned before in the issues, but I'm collecting it here so that it's easy to find).
See also https://github.com/oracleofnj/einsum
See also issue #16
It sounds exciting and I would like to help. Are you planning to create a new eisum crate in rust-ndarray or use the existing one?
I am not actively maintaining the crate I published anymore so please feel free to fork my repo or otherwise take whatever code you want from it. I'm happy to answer any questions you may have on my implementation.
Incidentally, something has changed between ndarray v0.14 and v0.15 which makes my crate not work with 0.15 - I get an assertion error from matrixmultiply-0.3.0/src/gemm.rs:293:13 when I try to run the tests with ndarray v0.15.
@oracleofnj Hi! Very cool project. How far did you get with einsum?
If you're hitting that assertion, that's exactly the bug https://github.com/bluss/matrixmultiply/issues/55 I think. Are you on macos? Can you reproduce it? Probably it's a bug, just one I can't reproduce. Potentially a macos/rust bug, if we're not doing something wrong.
The quickest workaround I have is trying to run the tests in release mode or enabling optimizations for matrixmultiply.
Edit: There's a branch with a fix and feedback on that would be valuable. https://github.com/bluss/matrixmultiply/pull/56 :slightly_smiling_face:
@bluss Yes, I am on macos and I can reproduce the bug in the sense that every time I run cargo test on my crate, a number of tests fail; however--- it's not always the same tests which fail each time. If I run the tests in release mode, they all pass.
As for how far I got, I would say pretty far! As far as I know, it works and it is generally compatible with numpy. Also, at the time, the performance penalty was pretty small according to the benchmarks I ran (basically a fixed overhead cost associated with parsing the expression but it was "doing the right thing" once it converted the string to the underlying operations).
The main performance issue that I never got around to tackling was associated with optimizing complicated expressions, i.e. the optimal order in which to perform matrix multiplications if there are multiple matrices involved in the operation.
Wow, that's great. It should totally be released as something more prominent that people can use. Thanks for confirming the bug, of course it has higher priority now that it's more than one mysterious report..
The optimization order thing I mentioned is alluded to here; I have a placeholder in the crate to support a more intelligent algorithm.
I'd be happy to see if that branch fixes the bug for me but I'm not sure how to try it since I'm just importing ndarray as a dependency.
In your cargo.toml you add this:
[patch.crates-io]
matrixmultiply = { git = "https://github.com/bluss/matrixmultiply", branch = "align-manually" }
I used https://doc.rust-lang.org/cargo/reference/overriding-dependencies.html as reference
That fixes it for me.
I just pushed v0.7.0 (which depends on ndarray v0.15) to crates.io.
I am also excited by the idea of having a einsum functionality to easily perform complicated matrix operations. However, in the case of Rust, I was wondering if it wouldn't be nicer to hand off the optimization of the contraction path to the compiler to avoid any overhead during runtime. I'm aware that the correct calculation of the cost of each path depends on the size of the array axes, but if I understand it correctly, it is the relations to each other that matter here and not the absolute sizes. So in the case of axes with very different lengths, these ratios could also be passed to a macro. I'm not sure if I'm missing something else here, so I'd be interested to know if there's anything in your view that would speak against the idea of an einsum macro.
@hochej I don't think you're missing anything; in fact I made a "Convert to macro" issue oracleofnj/einsum#2 in my repo that I never got around to doing. I agree that having a macro implementation would probably be superior for multiple reasons, particularly if the dimensions are known at compile time.
I try creating a proc-macro based implementation of einsum https://github.com/termoshtt/einsum-derive
It works like
use ndarray::array;
use einsum_derive::einsum;
let a = array![
[1.0, 2.0],
[3.0, 4.0]
];
let b = array![
[1.0, 2.0],
[3.0, 4.0]
];
let c = einsum!("ij,jk->ik", a, b);
assert_eq!(c, array![
[6.0, 8.0],
[12.0, 16.0]
]);
Roughly, the proc-macro einsum!("ij,jk->ik", a, b) generates a function named ij_jk__ik which takes two Array2 and returns Array2:
fn ij_jk__ik<T, S0, S1>(
arg0: ndarray::ArrayBase<S0, ndarray::Ix2>,
arg1: ndarray::ArrayBase<S1, ndarray::Ix2>,
) -> ndarray::Array<T, ndarray::Ix2>
where
T: ndarray::LinalgScalar,
S0: ndarray::Data<Elem = T>,
S1: ndarray::Data<Elem = T>,
{ ... }
and then call ij_jk__ik(a, b).
This also supports einsum factorization, e.g. three matrices multiplication ij,jk,kl->il will be factorized into two successive matrix multiplication ij,jk->ik and ik,kl->il, and thus computed in $O(N^3)$ order if all size are $N$. This factorization is done by brute force while proc-macro. For example, three-matrices and vector multiplication $ABCv$ or ij,jk,kl,l->i has several possible factorization, two step way: $A(BCv)$, $(AB)Cv$, $A(BC)v$, ..., and three step way $A(B(Cv))$, $((AB)C)v)$, and so on. The einsum! generates every possible factorization and evaluates computation order assuming every dimension size are same and ignoring strides, i.e. $A(B(Cv))$ is selected because only this factorization has $O(N^2)$ order. This can be determined only from the subscript, not depends on shapes or strides of input tensors.
As discussed in this issue, we may find better factorization if we seek optimal path with the shape and strides of input tensors, but it is impossible for proc-macro based approach.
Currently (on 0.1.0) this does not support to generate BLAS routines. It simply generates naive for-loop.