Sam
Sam
您好,感谢能开源这么好的框架,使用起来也非常方便,但是我在分布式训练Line时遇到个问题: 分布式(1ps+4worker)训练的loss变得非常大,我确定数据按分partitions的方式进行切分了,并进行了验证,确保没有问题,训练参数如下: nohup python -m tf_euler \ --ps_hosts=xxx:1999 \ --worker_hosts=xxx:2000,xxx:2000,xxx:2000,xxx:2000 \ --job_name=worker \ --task_index=0 \ --data_dir hdfs:xxxxx/euler/test_data/ \ --model_dir=hdfs:x/euler/LINE_embedding \ --euler_zk_addr xxx:2181 \ --euler_zk_path /test_embedding \ --max_id 8428196 \ --learning_rate...
InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that...