Worker should stop task and report failed to master if a mini-batch of the task failed.
A mini-batch may failed during executing mini-batches of task, but the worker will continue and report err_msg to master, then report task done after completing all mini-batches of the task. The worker should stop execution of the task and report failed to master. Then master resets the task in todo queue.
Please refer to #1253 . Maybe we could just let it failed. Task status is also no need to keep.
Please refer to #1253 . Maybe we could just let it failed. Task status is also no need to keep.
This issue is different from #1253. This issue states a bug: A worker may tell the master that the task T is finished successful even if some mini-batches in task T failed.
This is because that function report_record_done ignores err_msg if the mini-batch is not the last mini-batch in the current task. Please refer to
https://github.com/sql-machine-learning/elasticdl/blob/54554153748eb74c2aef907bdb8f517858547c93/elasticdl/python/worker/task_data_service.py#L40-L64
@workingloong Any update on this issue? If not, I can try to fix the problem.
@workingloong Any update on this issue? If not, I can try to fix the problem.
No any.
After a further investigation into the code base, it seems that the err_msg's length is used to indicate the status of a mini-batch. If any worker reports a non-empty message to the master, it means the task dispatched to the worker fails. And TaskDispatcher will add the task to the todo queue.
TaskDataService of a worker just reports the message back to master.
I will close this issue. If any problem exists, please reopen it and ping me.
After a further investigation into the code base, it seems that the
err_msg's length is used to indicate the status of a mini-batch. If any worker reports a non-empty message to the master, it means the task dispatched to the worker fails. AndTaskDispatcherwill add the task to thetodoqueue.
TaskDataServiceof a worker just reports the message back to master.I will close this issue. If any problem exists, please reopen it and ping me.
TaskDataService only reports the message when the mini-batch is the last one in the task. So the other mini-batch err_msg will be ignored, because self._reported_record_count < self._pending_tasks_with_counts[0][1].
https://github.com/sql-machine-learning/elasticdl/blob/54554153748eb74c2aef907bdb8f517858547c93/elasticdl/python/worker/task_data_service.py#L40-L64