[Problem] Training step
Hi,
I followed the instruction to run the training:python train.py using the default settings max_epoch=500
At the end of epoch 499, there is error popping up:
max_epoch 500
**** EPOCH 499 ****
2019-01-22 17:23:39.480862
Progress: [##########] 100%mean loss: 0.062824
Overall accuracy : 0.993542
Average IoU : 0.966070
IoU of man-made terrain : 0.978290
IoU of natural terrain : 0.991271
IoU of high vegetation : 0.995123
IoU of low vegetation : 0.932481
IoU of buildings : 0.994296
IoU of hard scape : 0.950104
IoU of scanning artifact : 0.926501
IoU of cars : 0.960493
(tf) william@william-Ubuntu:/media/william/E/Open3D-PointNet2-Semantic3D$ Process ForkPoolWorker-1:1:
Traceback (most recent call last):
File "/media/william/E/anaconda3/envs/tf/lib/python3.6/multiprocessing/pool.py", line 125, in worker
put((job, i, result))
File "/media/william/E/anaconda3/envs/tf/lib/python3.6/multiprocessing/queues.py", line 347, in put
self._writer.send_bytes(obj)
File "/media/william/E/anaconda3/envs/tf/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/media/william/E/anaconda3/envs/tf/lib/python3.6/multiprocessing/connection.py", line 397, in _send_bytes
self._send(header)
File "/media/william/E/anaconda3/envs/tf/lib/python3.6/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/media/william/E/anaconda3/envs/tf/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/media/william/E/anaconda3/envs/tf/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/media/william/E/anaconda3/envs/tf/lib/python3.6/multiprocessing/pool.py", line 130, in worker
put((job, i, (False, wrapped)))
File "/media/william/E/anaconda3/envs/tf/lib/python3.6/multiprocessing/queues.py", line 347, in put
self._writer.send_bytes(obj)
File "/media/william/E/anaconda3/envs/tf/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/media/william/E/anaconda3/envs/tf/lib/python3.6/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/media/william/E/anaconda3/envs/tf/lib/python3.6/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Process ForkPoolWorker-1:5:
Traceback (most recent call last):
File "/media/william/E/anaconda3/envs/tf/lib/python3.6/multiprocessing/pool.py", line 125, in worker
put((job, i, result))
File "/media/william/E/anaconda3/envs/tf/lib/python3.6/multiprocessing/queues.py", line 347, in put
self._writer.send_bytes(obj)
File "/media/william/E/anaconda3/envs/tf/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/media/william/E/anaconda3/envs/tf/lib/python3.6/multiprocessing/connection.py", line 397, in _send_bytes
self._send(header)
File "/media/william/E/anaconda3/envs/tf/lib/python3.6/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/media/william/E/anaconda3/envs/tf/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/media/william/E/anaconda3/envs/tf/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/media/william/E/anaconda3/envs/tf/lib/python3.6/multiprocessing/pool.py", line 130, in worker
put((job, i, (False, wrapped)))
File "/media/william/E/anaconda3/envs/tf/lib/python3.6/multiprocessing/queues.py", line 347, in put
self._writer.send_bytes(obj)
File "/media/william/E/anaconda3/envs/tf/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/media/william/E/anaconda3/envs/tf/lib/python3.6/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/media/william/E/anaconda3/envs/tf/lib/python3.6/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Would anyone please advise on what might go wrong? Thanks. William
Looks like the multiprocessing Pool or Queue worker processes for dataset pre-fetching are not properly terminated at the end of the training. Luckily, this won't affect the training results, as it only happens at the end when the training is done. Will need to fix by properly terminating the worker processes.
Looks like the multiprocessing
PoolorQueueworker processes for dataset pre-fetching are not properly terminated at the end of the training. Luckily, this won't affect the training results, as it only happens at the end when the training is done. Will need to fix by properly terminating the worker processes.
Thanks for your kind reply. Yes, I notice it wont affect the training results, as each best model has been saved. Would like to ask one more thing: If the training process is interrupted, say at epoch 324, is there any flag or parameter to make train.py resume training at epoch 324, or the last saved model?
Thanks!
can you tell me which tensorflow version you used?
Hi @yxlao @yulongyu , I am also having the same issue. Is there any update on it?
Also,
can you tell me which tensorflow version you used?
I am using tensorflow-gpu version 1.12.0