Pearson Zhao
Pearson Zhao
@layumi ```` bash Epoch: 00 Iteration: 00000094/00100000 DLoss: 5.9224 Reg: 0.0000 Elapsed time in upate: 0.364377 Traceback (most recent call last): File "train.py", line 122, in trainer.dis_update(images_a, images_b, config) File...
@layumi yes, has some warnings, but I didn't care. Do you think it might be caused by these warnings? I will reinstall `apex`.
@layumi I reinstalled several times in different environments.I don't think there should be much relationship with the installation.when installing, appeared warmings about `command line option '-Wstrict-prototypes' is valid for C/ObjC...
@layumi I agree with `it looks like overflow`. ```` Epoch: 00 Iteration: 00000083/00100000 DLoss: nan Reg: nan Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.938735877055719e-39 ........
Hi @layumi Sorry, I have something late, forget about github. ```` DLoss:6.0000 Reg:0.0000 Gradient overflow, skipping step, loss scaler 0 reducing loss scale to 32768.0 Gradient overflow, skipping step, loss...
@layumi I downloaded market1501 again,but still appeared the same error. I trained the following command, it's OK. `python train.py --config configs/latest.yaml` But the following command, it's not OK. `pytho train.py...
@layumi 1. I trained completely about `Person_reID_baseline_pytorch`, and not appear `nan`. So I think `apex` and `cuda` have no problem. ```` Epoch 59/59 train_loss:0.0205 Acc: 0.9979 val_loss: 0.2508 Acc: 0.9374...
@layumi I try it, still appear the same error.
@layumi Yes, I print it, tensor value,
@layumi appear `nan`, print it as following, other values are close. ```` print(torch.mean((out0.float()-0)**2)) tensor(1.9329e-10,device='cuda:0') print(torch.mean((out1.float()-1)**2)) tensor(0.9999,device='cuda:0') ````