wormhole icon indicating copy to clipboard operation
wormhole copied to clipboard

linear local mode error:JUST_A_UNKNOWN_NODE is disconnected

Open alaleiwang opened this issue 8 years ago • 0 comments

command is : repo/dmlc-core/tracker/dmlc-submit --cluster local --env DMLC_CPU_VCORES=1 --env DMLC_MEMORY_MB=512 --num-workers 2 --num-servers 1 --worker-cores 1 --server-cores 1 learn/linear/build/linear.dmlc learn/linear/guide/demo.conf

client error show: Connected 1 servers and 2 workers Training: iter = 0 sec ttl #ex inc #ex |w|_0 logloss accuracy AUC Exception in thread Thread-3: Traceback (most recent call last): File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner self.run() File "/usr/lib64/python2.7/threading.py", line 764, in run self.__target(*self.__args, **self.__kwargs) File "/opt/alalei/wormhole_new2/repo/dmlc-core/tracker/dmlc_tracker/local.py", line 45, in exec_cmd raise RuntimeError('Get nonzero return code=%d' % ret) RuntimeError: Get nonzero return code=-11

Exception in thread Thread-2: Traceback (most recent call last): File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner self.run() File "/usr/lib64/python2.7/threading.py", line 764, in run self.__target(*self.__args, **self.__kwargs) File "/opt/alalei/wormhole_new2/repo/dmlc-core/tracker/dmlc_tracker/local.py", line 45, in exec_cmd raise RuntimeError('Get nonzero return code=%d' % ret) RuntimeError: Get nonzero return code=-11

/tmp/linear.dmlc.H.log.INFO.20170918-133532.8524 show: I0918 13:35:32.054481 8524 van.cc:30] I'm [role: SCHEDULER id: "H" hostname: "10.2.177.240" port: 9092] I0918 13:35:32.057025 8524 manager.cc:34] Staring system. Logging into /tmp/linear.dmlc.log.* I0918 13:35:32.068665 8557 workload_pool.h:168] assign W_10.2.177.240_52648 job learn/data/agaricus.txt.train 0 / 10. 1 #jobs on processing. I0918 13:35:32.068797 8557 workload_pool.h:168] assign W_10.2.177.240_43323 job learn/data/agaricus.txt.train 1 / 10. 2 #jobs on processing. I0918 13:35:32.125037 8556 manager.cc:275] JUST_A_UNKNOWN_NODE is disconnected

/tmp/linear.dmlc.W_10.2.177.240_43323.log.INFO.20170918-133532.8525 show: I0918 13:35:32.048825 8525 van.cc:30] I'm [role: WORKER id: "W_10.2.177.240_43323" hostname: "10.2.177.240" port: 43323] I0918 13:35:32.069018 8551 minibatch_solver.h:291] iter = 0, training, learn/data/agaricus.txt.train 1 / 10, minibatch = 1000, concurrency = 2, shuffle ratio = 10000, negative sampling =

/tmp/linear.dmlc.S_10.2.177.240_40067.log.INFO.20170918-133532.8529 show: I0918 13:35:32.052947 8529 van.cc:30] I'm [role: SERVER id: "S_10.2.177.240_40067" hostname: "10.2.177.240" port: 40067]

alaleiwang avatar Sep 18 '17 05:09 alaleiwang