elasticdl icon indicating copy to clipboard operation
elasticdl copied to clipboard

Worker should stop task and report failed to master if a mini-batch of the task failed.

Open workingloong opened this issue 6 years ago • 6 comments

A mini-batch may failed during executing mini-batches of task, but the worker will continue and report err_msg to master, then report task done after completing all mini-batches of the task. The worker should stop execution of the task and report failed to master. Then master resets the task in todo queue.

workingloong avatar Sep 26 '19 02:09 workingloong

Please refer to #1253 . Maybe we could just let it failed. Task status is also no need to keep.

QiJune avatar Sep 27 '19 00:09 QiJune

Please refer to #1253 . Maybe we could just let it failed. Task status is also no need to keep.

This issue is different from #1253. This issue states a bug: A worker may tell the master that the task T is finished successful even if some mini-batches in task T failed.

This is because that function report_record_done ignores err_msg if the mini-batch is not the last mini-batch in the current task. Please refer to

https://github.com/sql-machine-learning/elasticdl/blob/54554153748eb74c2aef907bdb8f517858547c93/elasticdl/python/worker/task_data_service.py#L40-L64

mhaoli avatar Sep 29 '19 03:09 mhaoli

@workingloong Any update on this issue? If not, I can try to fix the problem.

chunyang-wen avatar Oct 12 '19 08:10 chunyang-wen

@workingloong Any update on this issue? If not, I can try to fix the problem.

No any.

workingloong avatar Oct 15 '19 03:10 workingloong

After a further investigation into the code base, it seems that the err_msg's length is used to indicate the status of a mini-batch. If any worker reports a non-empty message to the master, it means the task dispatched to the worker fails. And TaskDispatcher will add the task to the todo queue.

TaskDataService of a worker just reports the message back to master.

I will close this issue. If any problem exists, please reopen it and ping me.

chunyang-wen avatar Oct 28 '19 05:10 chunyang-wen

After a further investigation into the code base, it seems that the err_msg's length is used to indicate the status of a mini-batch. If any worker reports a non-empty message to the master, it means the task dispatched to the worker fails. And TaskDispatcher will add the task to the todo queue.

TaskDataService of a worker just reports the message back to master.

I will close this issue. If any problem exists, please reopen it and ping me.

TaskDataService only reports the message when the mini-batch is the last one in the task. So the other mini-batch err_msg will be ignored, because self._reported_record_count < self._pending_tasks_with_counts[0][1].

https://github.com/sql-machine-learning/elasticdl/blob/54554153748eb74c2aef907bdb8f517858547c93/elasticdl/python/worker/task_data_service.py#L40-L64

workingloong avatar Oct 28 '19 06:10 workingloong