Crash running dopri5
Hello! I'm having issues running the dopri5 method on a time sequence, very closely to what is in the latent example. I am trying to run odeint using odeint_adjoint with a z0 batch of shape (batchsize, time_sequence, 1). Whenever I try to run the following code, I get the error below:
print(times)
print(z0)
pred_z = odeint(func, z0, times, method="dopri5").permute(1, 0, 2) # odeint_adjoint
tensor([ 0., 60., 120., 180.], device='cuda:0')
tensor([[ 3.4961e-01, -1.3341e+00, -1.0196e+00, -5.6237e-01, -5.7106e-01,
-2.0538e-01, -2.0877e-01, 1.6032e-01, -1.0014e-01, 8.7283e-01,
2.0110e+00, -6.1008e-01, 2.5714e-01, 2.3386e+00, 2.7314e+00,
2.4449e+00],
[ 2.3964e-01, -3.9234e+00, -5.1578e-01, -1.1946e+00, -1.5457e+00,
-7.2809e-01, 8.4074e-01, 2.4824e+00, 3.0488e-01, 3.8429e-01,
3.0424e-01, -3.8435e-01, 1.7489e+00, 1.0455e+00, -2.3369e-01,
-1.3098e+00],
[-3.7673e-01, -6.8062e-01, -7.5301e-02, -4.2621e-01, 1.0845e+00,
-4.3786e-01, 4.9334e-01, 1.7223e+00, 9.4618e-01, 9.6530e-01,
2.8994e+00, 1.0563e+00, 8.6989e-01, 1.9997e+00, -1.1819e+00,
2.2736e-01],
[ 6.3911e-01, -2.0203e-01, 1.4277e+00, -2.0914e-01, -1.9965e+00,
7.3284e-02, -5.9003e-01, -9.9907e-01, 1.2299e-01, -8.0105e-02,
1.2543e+00, 8.7276e-01, 2.9519e-01, 6.1938e-01, 1.2256e+00,
-2.1609e-01],
[ 7.0405e-01, 9.3754e-02, -2.9164e-02, -2.3351e-02, -2.7254e-01,
-6.6201e-01, -1.2737e+00, 6.7255e-01, -2.8363e-01, -6.3016e-01,
2.9853e+00, 1.7805e+00, 2.4158e-01, 1.1367e+00, 1.4954e+00,
1.8174e-01],
[ 8.2383e-01, -8.8112e-01, -1.2737e+00, 6.5401e-02, -4.7465e-01,
4.9482e-01, 7.0683e-01, 5.8325e-01, -9.1313e-01, -9.0717e-01,
1.9697e+00, -8.7827e-01, -3.4570e-01, 2.8642e-01, -2.0495e+00,
2.0563e+00],
[ 9.1569e-01, -1.2349e+00, -1.5394e+00, -7.0736e-01, -5.5272e-01,
-1.5898e+00, -6.0082e-01, 1.3908e+00, 2.7930e-01, 6.9085e-01,
5.1686e-01, 6.5842e-01, 6.8905e-01, 9.7911e-01, 5.6687e-01,
-6.5162e-01],
[ 1.5812e-01, -1.2457e+00, 1.9778e+00, 5.4664e-02, 8.2410e-01,
-2.6325e+00, 7.3439e-01, -1.0063e-02, 1.9677e-01, 2.4709e-01,
1.4928e+00, -2.9880e-01, 2.1503e+00, 1.8539e+00, 8.1897e-02,
4.5690e-01],
[ 4.1733e-01, -1.3703e-01, 3.1289e-01, -1.1011e+00, -1.3120e+00,
-2.0392e+00, -1.1899e+00, 6.1899e-01, 1.2533e+00, -1.9775e+00,
8.3711e-01, 4.2185e-01, 5.4436e-01, 1.0219e+00, -9.9984e-01,
-1.1049e+00],
[ 2.3511e-01, -2.3007e+00, -7.5956e-01, -7.1586e-01, -4.7162e-01,
1.9671e-01, 6.2589e-02, 9.0480e-01, -6.4017e-01, -1.6957e-01,
7.3696e-01, -9.2881e-01, -5.9447e-01, 3.0396e-01, 6.7777e-03,
-4.9320e-01],
[ 4.3563e-01, -2.3411e-01, 1.1254e+00, 1.9592e-01, -1.8889e+00,
-2.6352e+00, 7.4621e-01, 1.7433e+00, -5.3961e-01, 1.2617e+00,
1.7898e+00, 2.7057e-01, 1.8180e+00, 1.7901e+00, -8.2793e-01,
1.5555e+00],
[ 9.7916e-02, -2.3332e+00, -2.9267e+00, -5.0806e-01, -7.4166e-01,
4.6953e-01, 7.0740e-01, 9.7310e-01, 2.7399e-01, 6.2945e-01,
1.6237e+00, 1.4580e+00, -2.0365e-01, 1.5745e+00, -1.2565e+00,
1.0269e-01],
[ 1.8389e-01, -3.0685e-01, 7.5610e-01, -1.2899e+00, 9.0083e-01,
-8.9824e-01, 1.1224e+00, 9.1232e-01, 4.6421e-01, -2.4713e-01,
1.5397e+00, 7.5857e-01, 1.2211e-01, 1.3789e+00, -1.3621e+00,
1.6946e+00],
[ 8.4649e-01, -3.3510e-01, -3.8854e-03, 4.8532e-01, -1.7783e+00,
-1.0863e+00, -6.1646e-02, 1.9066e+00, -2.7216e-01, 8.5870e-01,
1.2407e+00, -7.2768e-02, 1.1854e+00, 2.8140e+00, 4.0069e-01,
7.5213e-01],
[ 2.7837e-01, -1.2072e+00, 9.8222e-01, -1.0471e-01, -1.0825e+00,
-9.6411e-04, 1.4368e+00, 1.3269e+00, -8.2408e-01, -2.9107e+00,
2.6670e+00, 1.2564e-01, 3.1355e+00, 1.3564e+00, -1.1322e+00,
1.7455e+00],
[ 5.5862e-01, 6.3158e-02, -1.1971e+00, -6.8839e-01, 3.9886e-01,
-9.8823e-01, 1.4100e+00, 3.3176e-01, -1.6303e+00, 1.1427e+00,
4.9966e-01, 6.6720e-01, 1.2786e+00, 1.3973e+00, -5.7325e-01,
-1.5150e+00]], device='cuda:0', grad_fn=<AddBackward0>)
Traceback (most recent call last):
File "main.py", line 13, in <module>
train_model("model", 10, gen)
File "D:\github\Crispy\beta-predict\ode_torch.py", line 171, in train_model
loss.backward()
File "C:\Python36\lib\site-packages\torch\tensor.py", line 107, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "C:\Python36\lib\site-packages\torch\autograd\__init__.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
File "C:\Python36\lib\site-packages\torch\autograd\function.py", line 77, in apply
return self._forward_cls.backward(self, *args)
File "d:\github\crispy\torchdiffeq\torchdiffeq\_impl\adjoint.py", line 83, in backward
torch.tensor([t[i], t[i - 1]]), rtol=rtol, atol=atol, method=method, options=options
File "d:\github\crispy\torchdiffeq\torchdiffeq\_impl\odeint.py", line 72, in odeint
solution = solver.integrate(t)
File "d:\github\crispy\torchdiffeq\torchdiffeq\_impl\solvers.py", line 31, in integrate
y = self.advance(t[i])
File "d:\github\crispy\torchdiffeq\torchdiffeq\_impl\dopri5.py", line 90, in advance
self.rk_state = self._adaptive_dopri5_step(self.rk_state)
File "d:\github\crispy\torchdiffeq\torchdiffeq\_impl\dopri5.py", line 100, in _adaptive_dopri5_step
assert t0 + dt > t0, 'underflow in dt {}'.format(dt.item())
AssertionError: underflow in dt 0.0
Also, if I try with adams it just hangs.
Any idea what is going on?
Thanks in advance!
This underflow error indicates that the system is too stiff for an explicit method to solve. The step size needs to be extremely small in order to satisfy the desired tolerance. If dopri5 can't solve it then adams will have the same problem. If you can get away with it, I'd suggest trying to increase the tolerance parameters first atol and rtol. This introduces more numerical error but easier/faster solves.
Stiffness is still a difficult problem to get around. Try architectures with nicer (analytic) activations function, e.g. softplus, Swish. We also used weight decay to reduce the complexity of the system.
Thank you for your answer. Is there a way to get information on those internal parameters for fine tuning?
Hmm you're right. It's not clear from the README.md or docstring.
The basic idea is each adaptive solver can produce an error estimate of the current step, and if the error is greater than some tolerance, then the step is redone with a smaller step size, and this repeats until the error is smaller than the provided tolerance.
The tolerance is calculated as atol + rtol * norm of current state, where the norm being used is the infinity norm (https://github.com/rtqichen/torchdiffeq/blob/master/torchdiffeq/_impl/misc.py#L152).
Maybe we could specify a minimum value as a parameter so it can try its best but chose that value if needed?