在进行分布式训练的时候,loss在开始几步之后,变成一样的了,是什么原因呢?
INFO:tensorflow:loss = 110098360.0, mrr = 0.40143234, step = 267 (2.804 sec) INFO:tensorflow:loss = 2129.3452, mrr = 0.16666669, step = 494 (2.631 sec) INFO:tensorflow:loss = 2129.3452, mrr = 0.16666669, step = 712 (2.521 sec) INFO:tensorflow:loss = 2129.3452, mrr = 0.16666669, step = 941 (2.626 sec) INFO:tensorflow:loss = 2129.3452, mrr = 0.16666669, step = 1162 (2.563 sec) INFO:tensorflow:loss = 2129.3452, mrr = 0.16666669, step = 1378 (2.481 sec) INFO:tensorflow:loss = 2129.3452, mrr = 0.16666669, step = 1601 (2.502 sec) INFO:tensorflow:loss = 2129.3452, mrr = 0.16666669, step = 1822 (2.501 sec) INFO:tensorflow:loss = 2129.3452, mrr = 0.16666669, step = 2053 (2.539 sec) INFO:tensorflow:loss = 2129.3452, mrr = 0.16666669, step = 2273 (2.489 sec) INFO:tensorflow:loss = 2129.3452, mrr = 0.16666669, step = 2490 (2.448 sec) INFO:tensorflow:loss = 2129.3452, mrr = 0.16666669, step = 2798 (2.735 sec) INFO:tensorflow:loss = 2129.3452, mrr = 0.16666669, step = 3200 (2.809 sec) INFO:tensorflow:loss = 2129.3452, mrr = 0.16666669, step = 3603 (2.766 sec) INFO:tensorflow:loss = 2129.3452, mrr = 0.16666669, step = 4015 (2.896 sec) INFO:tensorflow:loss = 2129.3452, mrr = 0.16666669, step = 4423 (2.844 sec) INFO:tensorflow:loss = 2129.3452, mrr = 0.16666669, step = 4835 (2.836 sec) INFO:tensorflow:loss = 2129.3452, mrr = 0.16666669, step = 5245 (2.868 sec) INFO:tensorflow:loss = 2129.3452, mrr = 0.16666669, step = 5641 (2.788 sec) INFO:tensorflow:loss = 2129.3452, mrr = 0.16666669, step = 6048 (2.842 sec) INFO:tensorflow:loss = 2129.3452, mrr = 0.16666669, step = 6452 (2.787 sec) INFO:tensorflow:loss = 2129.3452, mrr = 0.16666669, step = 6849 (2.749 sec) INFO:tensorflow:loss = 2129.3452, mrr = 0.16666669, step = 7242 (2.718 sec) INFO:tensorflow:loss = 2129.3452, mrr = 0.16666669, step = 7644 (2.756 sec) INFO:tensorflow:loss = 2129.3452, mrr = 0.16666669, step = 8039 (2.743 sec) INFO:tensorflow:loss = 2129.3452, mrr = 0.16666669, step = 8430 (2.666 sec)
遇到同样的问题了,请问您是怎么解决的啊
由于采样故障,导致的tensorflow-op内部拼接了大量的 default_value 是一个可能原因。