chen

Results 15 comments of chen

We also encountered this problem. our dataset is 60G executors: 128 memory: 16G executor cores: 4 Is there any good solution now?

hello, @svotaw I solved the problem with this parameter useBarrierExecutionMode=True . But it confuses me even more. ![截屏2022-11-05 下午11 45 38](https://user-images.githubusercontent.com/16252583/200128254-79671442-68a8-464b-a4f0-121f6de4de41.png) I looked at where the barrier works and can...

The version we are using now is the latest version. 0.10.1 . and thx for your reply. hello, @imatiach-msft ,Can you help me why this problem occurs?

![截屏2022-11-07 下午4 41 19](https://user-images.githubusercontent.com/16252583/200264737-ace11ece-6fae-4077-9232-c4586d382465.png) I think I found the root cause code of this error, @svotaw @imatiach-msft . if the number of tasks requested is inconsistent with the number of...

Maybe there can have a strategy here to skip the check logic and set numTasks to the number of workers obtained.: strategy: when a certain ratio of workers are connected...

Sorry for reply so late. In fact I am not sure why numTasks not match the actual number of Tasks.While the driver is waiting for accept, will a task failure...

Even when my numTasks number is 512, it runs successfully most of the case.

I think I reproduced the problem. without Barrier Execution Mode. dataset : 40G executors: 64 memory: 16G executor cores: 2 numTasks: 128 I have checked all the 128 tasks log...

The abnormal executor ID i 65.And It can be seen from the job graph that it be caused by a node 14 removed in the middle and restarting a new...