Transformer-SSL icon indicating copy to clipboard operation
Transformer-SSL copied to clipboard

Strange output log

Open launchauto opened this issue 4 years ago • 12 comments

Hi authors, I have pretrianed your moby_swin_tiny model using 8 Tesla V100 GPU and reproduced your results in downstream task. I get 74.394% on linear evaluation and 43.1% on COCO object detection task, 39.3% on COCO segmentation task. But the loss and grad_norm is really weired during training. Can you show me your log? Here is my log. The loss drops to 7 and then rises to 16, then never drop again. During the pretraining task, the grad norm average value sometimes rises to infinite. log_rank0.txt

launchauto avatar May 28 '21 03:05 launchauto

The uploaded txt log_rank0.txt is one of the eight gpus pretrain logs. And the uploaded txt log_rank7.txt is one of the eight gpus linear evaluation logs. log_rank7.txt

launchauto avatar May 28 '21 03:05 launchauto

I also encountered the same problem.

michuanhaohao avatar Jun 15 '21 02:06 michuanhaohao

@launchauto @michuanhaohao me too, but I run it with precision O0. Did you run with the O0 precision? log_rank0.txt

tbup avatar Feb 08 '22 06:02 tbup

我也遇到了这个问题!loss一直是16永远不会下降?

Rocky1salady-killer avatar Jun 23 '22 06:06 Rocky1salady-killer

怎么才能不适用apex混合精度呢?我使用swin transformer进行训练的时候,loss就会下降并且收敛。然而,我注意到swin transformer工程当中没有使用apex混合精度

Rocky1salady-killer avatar Jun 23 '22 06:06 Rocky1salady-killer

Is it normal for the loss value to be around 16? Has anyone encountered this problem?

Chengyang852 avatar Mar 21 '23 02:03 Chengyang852

怎么才能不适合用apex混合精度呢?我用swin transformer进行训练的时候,loss就会下降并收敛。不过,我注意到swin transformer工程中没有使用apex混

请问您的问题解决了吗

YohjiNtpu avatar Mar 27 '23 06:03 YohjiNtpu

loss值在16左右正常吗?有没有人遇到过这个问题?

loss值在16左右正常吗?有没有人遇到过这个问题?

我也是

YohjiNtpu avatar Mar 27 '23 06:03 YohjiNtpu

Excuse me, have you solved the problem that loss drops to 8.9 and then rises in the opposite direction? Is it caused by apex mixed precision training?

Pang-b0 avatar Apr 03 '23 09:04 Pang-b0

请问,loss下降到8.9然后反方向上升的问题解决了吗?是顶点混合精度训练导致的吗?

没有/(ㄒoㄒ)/~~

YohjiNtpu avatar Apr 03 '23 09:04 YohjiNtpu

会不会是loss函数的问题呀 这个代码你还在关注吗,我的loss从开始就是16 降不下去

Pang-b0 avatar Apr 03 '23 15:04 Pang-b0

不会是loss随便数的问题呀这个代号你还在关注吗,我的loss从开始就是16降不下

我也没有解决。。

YohjiNtpu avatar Apr 05 '23 02:04 YohjiNtpu