torchdiffeq icon indicating copy to clipboard operation
torchdiffeq copied to clipboard

Crash running dopri5

Open Moxinilian opened this issue 6 years ago • 6 comments

Hello! I'm having issues running the dopri5 method on a time sequence, very closely to what is in the latent example. I am trying to run odeint using odeint_adjoint with a z0 batch of shape (batchsize, time_sequence, 1). Whenever I try to run the following code, I get the error below:

print(times)
print(z0)
pred_z = odeint(func, z0, times, method="dopri5").permute(1, 0, 2) # odeint_adjoint
tensor([  0.,  60., 120., 180.], device='cuda:0')
tensor([[ 3.4961e-01, -1.3341e+00, -1.0196e+00, -5.6237e-01, -5.7106e-01,
         -2.0538e-01, -2.0877e-01,  1.6032e-01, -1.0014e-01,  8.7283e-01,
          2.0110e+00, -6.1008e-01,  2.5714e-01,  2.3386e+00,  2.7314e+00,
          2.4449e+00],
        [ 2.3964e-01, -3.9234e+00, -5.1578e-01, -1.1946e+00, -1.5457e+00,
         -7.2809e-01,  8.4074e-01,  2.4824e+00,  3.0488e-01,  3.8429e-01,
          3.0424e-01, -3.8435e-01,  1.7489e+00,  1.0455e+00, -2.3369e-01,
         -1.3098e+00],
        [-3.7673e-01, -6.8062e-01, -7.5301e-02, -4.2621e-01,  1.0845e+00,
         -4.3786e-01,  4.9334e-01,  1.7223e+00,  9.4618e-01,  9.6530e-01,
          2.8994e+00,  1.0563e+00,  8.6989e-01,  1.9997e+00, -1.1819e+00,
          2.2736e-01],
        [ 6.3911e-01, -2.0203e-01,  1.4277e+00, -2.0914e-01, -1.9965e+00,
          7.3284e-02, -5.9003e-01, -9.9907e-01,  1.2299e-01, -8.0105e-02,
          1.2543e+00,  8.7276e-01,  2.9519e-01,  6.1938e-01,  1.2256e+00,
         -2.1609e-01],
        [ 7.0405e-01,  9.3754e-02, -2.9164e-02, -2.3351e-02, -2.7254e-01,
         -6.6201e-01, -1.2737e+00,  6.7255e-01, -2.8363e-01, -6.3016e-01,
          2.9853e+00,  1.7805e+00,  2.4158e-01,  1.1367e+00,  1.4954e+00,
          1.8174e-01],
        [ 8.2383e-01, -8.8112e-01, -1.2737e+00,  6.5401e-02, -4.7465e-01,
          4.9482e-01,  7.0683e-01,  5.8325e-01, -9.1313e-01, -9.0717e-01,
          1.9697e+00, -8.7827e-01, -3.4570e-01,  2.8642e-01, -2.0495e+00,
          2.0563e+00],
        [ 9.1569e-01, -1.2349e+00, -1.5394e+00, -7.0736e-01, -5.5272e-01,
         -1.5898e+00, -6.0082e-01,  1.3908e+00,  2.7930e-01,  6.9085e-01,
          5.1686e-01,  6.5842e-01,  6.8905e-01,  9.7911e-01,  5.6687e-01,
         -6.5162e-01],
        [ 1.5812e-01, -1.2457e+00,  1.9778e+00,  5.4664e-02,  8.2410e-01,
         -2.6325e+00,  7.3439e-01, -1.0063e-02,  1.9677e-01,  2.4709e-01,
          1.4928e+00, -2.9880e-01,  2.1503e+00,  1.8539e+00,  8.1897e-02,
          4.5690e-01],
        [ 4.1733e-01, -1.3703e-01,  3.1289e-01, -1.1011e+00, -1.3120e+00,
         -2.0392e+00, -1.1899e+00,  6.1899e-01,  1.2533e+00, -1.9775e+00,
          8.3711e-01,  4.2185e-01,  5.4436e-01,  1.0219e+00, -9.9984e-01,
         -1.1049e+00],
        [ 2.3511e-01, -2.3007e+00, -7.5956e-01, -7.1586e-01, -4.7162e-01,
          1.9671e-01,  6.2589e-02,  9.0480e-01, -6.4017e-01, -1.6957e-01,
          7.3696e-01, -9.2881e-01, -5.9447e-01,  3.0396e-01,  6.7777e-03,
         -4.9320e-01],
        [ 4.3563e-01, -2.3411e-01,  1.1254e+00,  1.9592e-01, -1.8889e+00,
         -2.6352e+00,  7.4621e-01,  1.7433e+00, -5.3961e-01,  1.2617e+00,
          1.7898e+00,  2.7057e-01,  1.8180e+00,  1.7901e+00, -8.2793e-01,
          1.5555e+00],
        [ 9.7916e-02, -2.3332e+00, -2.9267e+00, -5.0806e-01, -7.4166e-01,
          4.6953e-01,  7.0740e-01,  9.7310e-01,  2.7399e-01,  6.2945e-01,
          1.6237e+00,  1.4580e+00, -2.0365e-01,  1.5745e+00, -1.2565e+00,
          1.0269e-01],
        [ 1.8389e-01, -3.0685e-01,  7.5610e-01, -1.2899e+00,  9.0083e-01,
         -8.9824e-01,  1.1224e+00,  9.1232e-01,  4.6421e-01, -2.4713e-01,
          1.5397e+00,  7.5857e-01,  1.2211e-01,  1.3789e+00, -1.3621e+00,
          1.6946e+00],
        [ 8.4649e-01, -3.3510e-01, -3.8854e-03,  4.8532e-01, -1.7783e+00,
         -1.0863e+00, -6.1646e-02,  1.9066e+00, -2.7216e-01,  8.5870e-01,
          1.2407e+00, -7.2768e-02,  1.1854e+00,  2.8140e+00,  4.0069e-01,
          7.5213e-01],
        [ 2.7837e-01, -1.2072e+00,  9.8222e-01, -1.0471e-01, -1.0825e+00,
         -9.6411e-04,  1.4368e+00,  1.3269e+00, -8.2408e-01, -2.9107e+00,
          2.6670e+00,  1.2564e-01,  3.1355e+00,  1.3564e+00, -1.1322e+00,
          1.7455e+00],
        [ 5.5862e-01,  6.3158e-02, -1.1971e+00, -6.8839e-01,  3.9886e-01,
         -9.8823e-01,  1.4100e+00,  3.3176e-01, -1.6303e+00,  1.1427e+00,
          4.9966e-01,  6.6720e-01,  1.2786e+00,  1.3973e+00, -5.7325e-01,
         -1.5150e+00]], device='cuda:0', grad_fn=<AddBackward0>)
Traceback (most recent call last):
  File "main.py", line 13, in <module>
    train_model("model", 10, gen)
  File "D:\github\Crispy\beta-predict\ode_torch.py", line 171, in train_model
    loss.backward()
  File "C:\Python36\lib\site-packages\torch\tensor.py", line 107, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "C:\Python36\lib\site-packages\torch\autograd\__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "C:\Python36\lib\site-packages\torch\autograd\function.py", line 77, in apply
    return self._forward_cls.backward(self, *args)
  File "d:\github\crispy\torchdiffeq\torchdiffeq\_impl\adjoint.py", line 83, in backward
    torch.tensor([t[i], t[i - 1]]), rtol=rtol, atol=atol, method=method, options=options
  File "d:\github\crispy\torchdiffeq\torchdiffeq\_impl\odeint.py", line 72, in odeint
    solution = solver.integrate(t)
  File "d:\github\crispy\torchdiffeq\torchdiffeq\_impl\solvers.py", line 31, in integrate
    y = self.advance(t[i])
  File "d:\github\crispy\torchdiffeq\torchdiffeq\_impl\dopri5.py", line 90, in advance
    self.rk_state = self._adaptive_dopri5_step(self.rk_state)
  File "d:\github\crispy\torchdiffeq\torchdiffeq\_impl\dopri5.py", line 100, in _adaptive_dopri5_step
    assert t0 + dt > t0, 'underflow in dt {}'.format(dt.item())
AssertionError: underflow in dt 0.0

Also, if I try with adams it just hangs. Any idea what is going on? Thanks in advance!

Moxinilian avatar May 30 '19 15:05 Moxinilian

This underflow error indicates that the system is too stiff for an explicit method to solve. The step size needs to be extremely small in order to satisfy the desired tolerance. If dopri5 can't solve it then adams will have the same problem. If you can get away with it, I'd suggest trying to increase the tolerance parameters first atol and rtol. This introduces more numerical error but easier/faster solves.

Stiffness is still a difficult problem to get around. Try architectures with nicer (analytic) activations function, e.g. softplus, Swish. We also used weight decay to reduce the complexity of the system.

rtqichen avatar Jun 04 '19 01:06 rtqichen

Thank you for your answer. Is there a way to get information on those internal parameters for fine tuning?

Moxinilian avatar Jun 04 '19 06:06 Moxinilian

Hmm you're right. It's not clear from the README.md or docstring.

The basic idea is each adaptive solver can produce an error estimate of the current step, and if the error is greater than some tolerance, then the step is redone with a smaller step size, and this repeats until the error is smaller than the provided tolerance.

The tolerance is calculated as atol + rtol * norm of current state, where the norm being used is the infinity norm (https://github.com/rtqichen/torchdiffeq/blob/master/torchdiffeq/_impl/misc.py#L152).

rtqichen avatar Jun 04 '19 14:06 rtqichen

Maybe we could specify a minimum value as a parameter so it can try its best but chose that value if needed?

Moxinilian avatar Jun 04 '19 19:06 Moxinilian