chen
chen
We also encountered this problem. our dataset is 60G executors: 128 memory: 16G executor cores: 4 Is there any good solution now?
okay ,thanks for your reply
hello, @svotaw I solved the problem with this parameter useBarrierExecutionMode=True . But it confuses me even more.  I looked at where the barrier works and can...
The version we are using now is the latest version. 0.10.1 . and thx for your reply. hello, @imatiach-msft ,Can you help me why this problem occurs?
 I think I found the root cause code of this error, @svotaw @imatiach-msft . if the number of tasks requested is inconsistent with the number of...
Maybe there can have a strategy here to skip the check logic and set numTasks to the number of workers obtained.: strategy: when a certain ratio of workers are connected...
Sorry for reply so late. In fact I am not sure why numTasks not match the actual number of Tasks.While the driver is waiting for accept, will a task failure...
Even when my numTasks number is 512, it runs successfully most of the case.
I think I reproduced the problem. without Barrier Execution Mode. dataset : 40G executors: 64 memory: 16G executor cores: 2 numTasks: 128 I have checked all the 128 tasks log...
The abnormal executor ID i 65.And It can be seen from the job graph that it be caused by a node 14 removed in the middle and restarting a new...