Sam issues

Results 2 issues of

Sam

分布式训练 Line Loss异常

您好，感谢能开源这么好的框架，使用起来也非常方便，但是我在分布式训练Line时遇到个问题：分布式(1ps+4worker)训练的loss变得非常大，我确定数据按分partitions的方式进行切分了，并进行了验证，确保没有问题，训练参数如下： nohup python -m tf_euler \ --ps_hosts=xxx:1999 \ --worker_hosts=xxx:2000,xxx:2000,xxx:2000,xxx:2000 \ --job_name=worker \ --task_index=0 \ --data_dir hdfs:xxxxx/euler/test_data/ \ --model_dir=hdfs:x/euler/LINE_embedding \ --euler_zk_addr xxx:2181 \ --euler_zk_path /test_embedding \ --max_id 8428196 \ --learning_rate...

您好，在进行完分布式训练后，导出模型时报出以下错误，请问一下该如何解决啊

InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that...