分布式训练为什么出现 Variables not initialized?
hello,咨询一下,我们进行分布式训练时(1 PS + 2 Worker),跑的是官网ppi数据模型,启动worker-0启动后会立刻才是计算,如下: ==> /tmp/log.worker.0 <== INFO:tensorflow:f1 = 0.40184546, loss = 0.69358194, step = 0 INFO:tensorflow:f1 = 0.4026365, loss = 0.5608437, step = 20 (1.201 sec) INFO:tensorflow:f1 = 0.4041432, loss = 0.53496814, step = 40 (1.069 sec) INFO:tensorflow:f1 = 0.4073638, loss = 0.5300987, step = 60 (1.115 sec)
但是,worker-1启动后,会有一段等待时间,然后才开始进行计算,具体日志如下: INFO:tensorflow:Graph was finalized. 2019-04-23 03:45:56.103402: I tensorflow/core/distributed_runtime/master_session.cc:1192] Start master session 27c7c5a9b0b92ad1 with config: gpu_options { allow_growth: true } INFO:tensorflow:Waiting for model to be ready. Ready_for_local_init_op: Variables not initialized: supervisedgraphsage_1/sageencoder_1/meanaggregator_1/dense_2/kernel, supervisedgraphsage_1/sageencoder_1/meanaggregator_1/dense_3/kernel, supervisedgraphsage_1/sageencoder_1/meanaggregator_2/dense_4/kernel, supervisedgraphsage_1/sageencoder_1/meanaggregator_2/dense_5/kernel, supervisedgraphsage_1/dense_1/kernel, supervisedgraphsage_1/dense_1/bias, global_step, beta1_power, beta2_power, supervisedgraphsage_1/sageencoder_1/meanaggregator_1/dense_2/kernel/Adam, supervisedgraphsage_1/sageencoder_1/meanaggregator_1/dense_2/kernel/Adam_1, supervisedgraphsage_1/sageencoder_1/meanaggregator_1/dense_3/kernel/Adam, supervisedgraphsage_1/sageencoder_1/meanaggregator_1/dense_3/kernel/Adam_1, supervisedgraphsage_1/sageencoder_1/meanaggregator_2/dense_4/kernel/Adam, supervisedgraphsage_1/sageencoder_1/meanaggregator_2/dense_4/kernel/Adam_1, supervisedgraphsage_1/sageencoder_1/meanaggregator_2/dense_5/kernel/Adam, supervisedgraphsage_1/sageencoder_1/meanaggregator_2/dense_5/kernel/Adam_1, supervisedgraphsage_1/dense_1/kernel/Adam, supervisedgraphsage_1/dense_1/kernel/Adam_1, supervisedgraphsage_1/dense_1/bias/Adam, supervisedgraphsage_1/dense_1/bias/Adam_1, num_finished_workers, ready: None
从日志看到worker-1貌似在等待模型变量初始化: INFO:tensorflow:Waiting for model to be ready. Ready_for_local_init_op: Variables not initialized
请教一下,这个问题怎么解决呢?
同问