Autograd error: Inplace operations modify variable needed for gradient computations.
I followed instructions from README.md and ran the following
python train.py --batch_size 1 --gpu_ids [0]
This yields the following error:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 6, 256]], which is output 0 of SliceBackward, is at version 2; expected version 1 instead.
If I activate autograd backtrace in pytorch, torch.autograd.set_detect_anomaly(True), I learn that the following in-place operation breaks the gradients computations.
https://github.com/chrischute/flowplusplus/blob/4f12a4624f6b7c372464f5ce0e7eb0dbb0864c89/util/array_util.py#L64
If I comment out the line the autograd computations run without yielding error, but the loss computations (obviously) become wrong. I'm a bit confused because there is another similar in-place operation which seems to not cause any issue with autograd (see code below).
if reverse:
y, z = (t.contiguous().view(b, c, h * w // 2) for t in x)
x = torch.zeros(b, c, h * w, dtype=y.dtype, device=y.device)
x[:, :, y_idx] += y # in place but no problem
x[:, :, z_idx] += z # in place and yields problem.
x = x.view(b, c, h, w)
I think there is a way to perform the needed computation without resorting to in-place computations. If I get it to work I'll post the alternative.
You could try changing that line to x[:, :, z_idx] = x[:, :, z_idx] + z.
Still I'm not sure why you'd run into this, since I never saw it before.
Sorry for not writing, already tired what you suggested without luck. Fortunately, the following fixed the issue in my case.
I read through the environment.yml file and realized it doesn't explicitly state which version of Python to install. Conda by default installs 3.7.3 on my setup. I suspected this might be the issue and tried with python 3.5.2 which worked.
I'm not sure why this makes a difference wrt "in place evaluations" for auto grad. I briefly looked at the update documentation different python versions trying to find anything with "in place evaluations", but unfortunately, without any luck.
On a slightly related note: I remember implementing RealNVP for TensorFlow some time ago. I tried to use assignment checkerboard implementations as done here, but read someone strongly advising against assignments for GPU code as they claimed it would be very slow. When running the code I get GPU-Utilization 10-20% as measured by NVTOP. Do you get similar results? I'll try to look more into this and let you know if I manage to increase GPU utilization or figure out if I did a mistake.
A big thanks for open sourcing your implementation, it is a huge help for my current research project.
Yes, GPU utilization is low with this checkerboard strategy, but I used it to preserve memory. This assignment strategy uses less memory compared to multiplying with a mask. Flow++ is memory-intensive especially with self-attention layers enabled.
If you want to try the version that will give you higher GPU utilization, you can find example code in this Real NVP repo here.