math [WIP] Adds boost root finders with reverse mode specializations

Summary

Adds a wrapper for boost's Halley root finding algorithm along with a reverse mode specialization similar to our algebra solver. Note these functions are only meant to be used at the Stan math level. Specifically for the quantile functions in https://github.com/stan-dev/design-docs/pull/42

This also works for forward mode, but all it does is iterate through the boost functions so it is probably very slow.

Tests

Tests for forward and reverse mode using the cubed and 5th root examples from boost's docs. Tests can be run with

./runTests.py -j4 ./test/unit/math/ -f root_finder

Side Effects

Note that I had to write specializations for frexp and sign for fvar and var types because boost would throw compiler errors if they did not exist.

The WIP here is because of some questions for @bgoodri

Would we ever want to use the newton raphson or schroder root finders? I wrote this keeping those in mind and it would be very simple to support.
Right now the API takes in a tuple of functors to calculate the value and derivatives. But you can see in the 5th root example this is kind of inefficient since we are just doing x^n and could share that computation between the calculations. For the reverse mode specialization we do need a functor to calculate just the function we want to solve, but we could also allow for the user to pass two functors where one returns a tuple with all the values for the root finder and another that just gives back the value of the function we want to solve.
Do we need functions for working over vectors elementwise or should we make that later?

Release notes

Adds boost root finders with reverse mode specializations for

Checklist

[ ] Math issue #
[x] Copyright holder: Steve Bronder

The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses: - Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause) - Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
[x] the basic tests are passing
- unit tests pass (to run, use: ./runTests.py test/unit)
- header checks pass, (make test-headers)
- dependencies checks pass, (make test-math-dependencies)
- docs build, (make doxygen)
- code passes the built in C++ standards checks (make cpplint)
[x] the code is written in idiomatic C++ and changes are documented in the doxygen
[x] the new changes are tested

Apr 29 '22 23:04 SteveBronder

(1) I imagine it will end up like the ODE solvers where we have slightly different versions with a mostly common set of arguments. But Halley is probably the one that would get used the most inside Stan Math.

Apr 29 '22 23:04 bgoodri

(2) It seems like it would be good to permit (or require) one functor that returns a tuple of: level, first derivative, second derivative.

Apr 29 '22 23:04 bgoodri

(3) If a root needs to be found for multiple parameter values, I would say that the caller is responsible for invoking root_finder multiple times.

Apr 29 '22 23:04 bgoodri

I think we should default the digits argument according to the recommendation in the Boost docs.

The value of digits is crucial to good performance of these functions, if it is set too high then at best you will get one extra (unnecessary) iteration, and at worst the last few steps will proceed by bisection. Remember that the returned value can never be more accurate than f(x) can be evaluated, and that if f(x) suffers from cancellation errors as it tends to zero then the computed steps will be effectively random. The value of digits should be set so that iteration terminates before this point: remember that for second and third order methods the number of correct digits in the result is increasing quite substantially with each iteration, digits should be set by experiment so that the final iteration just takes the next value into the zone where f(x) becomes inaccurate. A good starting point for digits would be 0.6D for Newton and 0.4D for Halley or Shröder iteration, where D is std::numeric_limits<T>::digits.

Apr 29 '22 23:04 bgoodri

You could also easily test a * x^2 + b * x + c = 0 where a, b, and / or c are var, fvar<var>, etc. to make sure our implicit da / dx, db / dx, and dc / dx match those obtained by differentiating x = (-b + sqrt(b^2 - 4 * a * c)) / (2 * a) explicitly.

Apr 29 '22 23:04 bgoodri

From your email, guess, min, and max should definitely be templated but we just need to pull out the underlying numbers since we wouldn't need to differentiate with respect to them.

Apr 29 '22 23:04 bgoodri

Also, @charlesm93 might have some thoughts. As I mentioned yesterday, the motivation for having a one-dimensional specialization of algebraic_solver in Stan Math (we are not worried about exposing it to the Stan language yet) is

Everything is in terms of scalars so no there is Eigen overhead making vectors of size 1, 1x1 Jacobians, etc.
We can use bounds to ensure the correct root is found
We can utilize second derivatives to obtain a tripling of the number of correct digits when close to the root

I guess the main question is what is the fastest way of implicitly differentiating the root with respect to unknown parameters? A concrete example would be thinking about implementing the inverse CDF of the Beta distribution in C++ with possibly unknown shape parameters. Given a cumulative probability, p, we would need to find the value of x between 0 and 1, such that beta_cdf(x, alpha, beta) - p = 0. We know the first derivative wrt x is the beta PDF and the second derivative is the first derivative of the beta PDF and can presume within Stan Math that the caller has specified those two derivatives explicitly. Once x is found, its sensitivity to p is the reciprocal of the beta PDF, but the sensitivity of x to alpha and beta requires implicit differentiation.

Apr 29 '22 23:04 bgoodri

We should probably also do the beta inverse CDF as a test case. Boost has a more advanced way of obtaining an initial guess in this case, but they still usually do Halley iteration

var p, alpha, beta; // give these values
using namespace boost::math; 
beta my_beta(alpha, beta);
auto x = quantile(my_beta, p);
// check our version against this

Apr 30 '22 00:04 bgoodri

So I got the tests for beta up and running, for the beta lcdf it seems to work fine for values where lgamma(alpha + beta) > 1. Maybe I wrote the derivative of the pdf wrt x wrong? Or there could be a numerics trick I'm missing.

For the test against boost's quantile function, var does not work with the boost distributions because of some functions they defined that we do not have defined for vars. But using finite diff I get answers that are within 4e-4. Perhaps I'm not doing something quite right there as well? Or it could be just finite diffs own error. There are also some places that finite diff gives -nan values for alpha and beta but where autodiff can return the correct values

# from running 
# ./runTests.py -j2 test/unit/math/rev/functor/root_finder_test.cpp
--- Finit Diff----
fx: 0.288633
grads: 
p: 0.6987
alpha: 0.245729
beta: -0.121384
--- Auto Diff----
fx: 0.288714
grads: 
p: 0.700012
alpha: 0.248911
beta: -0.12257

grad diffs: 
p: -0.00131203
alpha: -0.00318197
beta: 0.00118579

May 03 '22 20:05 SteveBronder

Also do we want to test lcdf or cdf functions?

May 03 '22 20:05 SteveBronder

There are analytical derivatives for the beta quantile case at

https://functions.wolfram.com/GammaBetaErf/InverseBetaRegularized/20/01/

May 10 '22 16:05 bgoodri

@bgoodri so I got the rev test up and running with the explicit gradients for the beta cdf. I also wrote a lil' function that just checks the gradients and our function to make sure they are near'ish to each other. Though the test is failing for some odd values that I wouldn't expect them to. The failing ones all print their values after a failure so you can take a look if you like (though note after one failure it will just keep printing all the sets of values it is testing. This is just because I have not figured out how to stop that from happening in gtest). I wonder if it could be some numerically inaccuracy with our beta_cdf function?

May 12 '22 01:05 SteveBronder

(2): if the function is going to be used internally, we should try and optimize as much as possible. I see two options:

having one functor argument which returns the value and one functor argument which returns a tuple with the value and the derivatives.
passing a single functor which admits an argument indicating the order of the derivatives we need to compute. Internally, only do the calculations for the specified order, and return a tuple of value and derivatives (with the latter potentially just containing nans).

If this is exposed to the user, we may want to trade some efficiency for a simpler API.

May 18 '22 12:05 charlesm93

Why is there a specialized rev method? The implicit function is valid for both forward and reverse mode autodiff. Since the solution is a scalar, the full Jacobian of derivatives reduces to a vector, so we don't gain much from doing contractions with either an initial tangent or a cotangent.

I think it would make sense to explicitly compute the full Jacobian and then do the vector-vector product for either forward or reverse mode.

May 18 '22 13:05 charlesm93

Thanks @charlesm93

May 18 '22 14:05 bgoodri

Thanks @charlesm93! I'm going to rework the API rn, I'd like to make it a bit more efficient for reverse mode which would mean supplying two separate functors, one for calculating the function and its derivatives all at the same time for efficiency, and another functor for calculating just the function

May 23 '22 16:05 SteveBronder

@bgoodri alrighty so I got reverse mode to be adjoints to be right up to 1e-9. It turns out that we need more precision when doing the forward pass during reverse mode which I found kind of weird and interesting. For the beta cdf example it seemsl like we need about 23 (???) digits of precision which is very odd to me. Like we only work in doubles so I'm not sure why having anything over 16 would be a thing? But tests now pass for testing alpha,beta, and p from .1 to .9

Jun 03 '22 21:06 SteveBronder

It turns out that we need more precision when doing the forward pass during reverse mode which I found kind of weird and interesting

The forward-mode values are often used as part of the derivative calculations, so I suspect that may be the problem.

For the beta cdf example it seemsl like we need about 23 (???) digits of precision which is very odd to me

As you point out, we only have about 16 digits of relative precision in double-precision, so we can't return results with that high precision. CPUs often do higher-precision internal arithmetic then truncate the return, but there's simply no way to get 23 digits of accuracy in a floating-point result at double precision. We can get to within 1e-23 of an answer, though in absolute precision.

Jun 09 '22 17:06 bob-carpenter

May I suggest that we adopt existing (closed-form) QF/CDF pairs for tests? Here are suggested QFs and CDFs (where missing in Stan).

Unbounded:
- logistic distribution $$Q(u)=\mu+s[\ln(u)-\ln(1-u)]$$
Semi-bounded:
- Weibull $$Q(u)=\lambda[-\ln(1-u)]^{1/k}$$
- Gumbel distributions $$Q(u)=\mu-\beta\ln(-\ln(u))$$
- Exponential (simplest QF, but perhaps most difficult to invert) $$Q(u)=(1/\lambda)\ln(1-u)$$
Bounded:
- Kumaraswamy distribution $$Q(u)=(1-(1-u)^{1/b})^{1/a}$$ $$F(x)=1-(1-x^a)^b$$

All of these distributions have closed form CDF and QF so we always have analytical solution to check against.

Jan 04 '23 16:01 dmi3kno