MathOptSymbolicAD.jl icon indicating copy to clipboard operation
MathOptSymbolicAD.jl copied to clipboard

Explore repeated terms in a function

Open odow opened this issue 3 years ago • 8 comments

A lot of problems have the form: sum_i f(x_i), for example machine learning problems where the sum is over observations in a dataset. We can improve the performance of this if we identify the common f term.

This structure appears in the inverse ising problems [private link]: https://github.com/lanl-ansi/nlp-jump-examples/blob/47d6ccb26c6041390b0662b1d8c1041f28d787ed/inverse_ising/inverse_ising-logrise.jl#L37-L40

odow avatar Sep 12 '22 22:09 odow

This also appears in @michel2323's GO models. At the very least, we should be able to deal with objective functions that are separable, even ones with a subset of terms that have a similar structure.

odow avatar Feb 08 '23 20:02 odow

See also https://github.com/JuliaSmoothOptimizers/ExpressionTreeForge.jl

cc @paraynaud

odow avatar Feb 14 '24 03:02 odow

The experimental MadDiff does this well IIRC. Since they type everything, compilation time absolutely blows up if the tool fails to spot such a repeated pattern though.

baggepinnen avatar Feb 14 '24 05:02 baggepinnen

Hello ! I am uncertain about what you are trying to calculate. If you are interested in partially-separable functions, $x_i$ (from the original post) are the variables parametrizing the element function $f_i$. In order to fulfill the requirements of partial separability, $x_i$ must be a subset of the decision variables. Note that in machine learning, the loss function applied to a minibatch sums (or mean) the loss function applied to labelled observations. If the loss function has no partially-separable structure, such as the negative log likelihood, each term of the sum involves all the variables/weights parametrizing the neural network. In that case, there is no partial separability.

A proper example of partial separability may be $||A x - b ||^2$ where $A$ is a sparse matrix, i.e., each row of $A$ must have at least one null component. Some loss functions are partially-separable, but it generally requires reworking the neural network architecture to reduce the subset of variables/weights parametrizing each element function. I wrote a paper mentioning more details : Partially-separable loss to parallelize partitioned neural network training (a GERAD preprint).

ExpressionTreeForge (ETF) is able to detect the partial separability of a function as long as its operators are supported by ETF. Practically, it retrieves the element functions $f_i$ and $U_i \in \mathbb{R}^{n_i \times n}$ which selects the subset of variables for $f_i$. I usually write $\hat{f}_i : \mathbb{R}^{n_i} \to \mathbb{R}$ such that $\hat{f}_i(U_i x) = f_i(x)$. From there, ETF considers several possibilities, such as :

  • apply automatic differentiation on the expression tree of $\hat{f}_i$;
  • build a JuMP model for each $\hat{f}_i$, and use the JuMP AD;
  • ...

However, none of my tries were able to compete with the gradient computation from JuMP. I thought a lot about how computing $\nabla \hat{f}_i$ during the computation of $\nabla f$, but it would require modifying the ReverseDiff tape, which is, unfortunately, out of my reach for now.

Final note, as it stands now, ETF is not a module particularly fast. Nonetheless, it works on a tree interface allowing to implement "easily" new expression tree data structures without making many changes in the algorithms detecting partial separability (which is its original purpose).

I hope this helps !

paraynaud avatar Feb 14 '24 20:02 paraynaud

Currently, MathOptSymbolicAD exploits problems in which there is repeated structure in the Jacobian. So if you have a bunch of constraints log(a[i] + x[i]) <= 0 then we create one symbolic derivative for J(ai, xi) = 1 / (ai + xi) and then we can map that function across the static data a and x.

But at the moment we don't recurse into a partially separable function. So if the objective is sum(log(a[i] + x[i]) for i in 1:N) we currently try to build a symbolic gradient with N terms. It'd be much cheaper if we could identify the repeated structure and exploit that.

The experimental MadDiff does this well IIRC

Yes. https://github.com/exanauts/ExaModels.jl does exactly what we do here, except they (ab)use the Julia type system instead of Symbolics.jl :) Their input format also forces the user to provide the repeated structure, instead of detecting it automatically like we do here.

odow avatar Feb 15 '24 02:02 odow

If I understand correctly, you use Symbolics.jl to compute symbolic differentiation J(ai, xi). In the case log(a[i] + x[i]^2) then does J = [1/(ai + xi^2) ; 2 xi/(ai + xi^2)] ?

ETF uses Symbolics only to retrieve an Expr from a julia function, onto which several tree manipulation routines can be applied. If you want to detect partial separability in the objective, you must use several tree manipulation routines. In my case, the detection of partial separability is done in by PartiallySeparableSolvers.jl in the method partitioned_structure with several routines from ETF. At some point it returns the vector containing the expression tree of each distinct element function ($M \leq N$) and the indices $p_i, , 1 \leq i \leq N$ such that $1 \leq p_i \leq M$. For sum(log(a[i] + x[i]) for i in 1:N) then M=1 and p = ones(N). The return expression tree can be an Expr, my handmade expression tree or JuMP's model. Does that fit your requirements ?

paraynaud avatar Feb 15 '24 17:02 paraynaud

Perhaps to clarify, I don't intend to use ETF in this package. It was just a link I was chatting to @amontoison about that seemed relevant and adjacent. It has some useful ideas :smile:

odow avatar Feb 15 '24 18:02 odow

Perhaps to clarity

I am glad you did, I misunderstood :') Feel free to contact me if you need to use ETF in the future !

paraynaud avatar Feb 15 '24 20:02 paraynaud