Stuck at validation step during 2x_HAT finetuning.

Open fatbardhfeta opened this issue 2 years ago • 1 comments

I am finetuning the 2x HAT model on my dataset but during the validation step the training freezes and I get this error:

UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [180, 6, 1, 1], strides() = [6, 1, 6, 6] bucket_view.sizes() = [180, 6, 1, 1], strides() = [6, 1, 1, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /home/fatbardhf/.virtualenvs/hat-env/lib/python3.10/site-packages/torch/autograd/__init__.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [180, 6, 1, 1], strides() = [6, 1, 6, 6] bucket_view.sizes() = [180, 6, 1, 1], strides() = [6, 1, 1, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.)

Jun 25 '23 13:06 fatbardhfeta

When I changed the pytorch version from 2.1.0 to 1.7.1, that message disappeared.

from conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia to conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 -c pytorch

Nov 10 '23 00:11 great-energizer