Yu OuYang
Yu OuYang
> t_return_code, cmd=cmd > subprocess.CalledProcessError: Command '['/home/chengpeng/data/anaconda3/envs/oneflow-dev-gcc7-v2/bin/python3', '-u', 'tools/train_net.py', '--config-file', 'configs/t5_pp_pretrain.py', 'train.dist.tensor_parallel_size=8', 'train.dist.pipeline_parallel_size=1', 'train.dist.data_parallel_size=1', 'train.zero_optimization.enabled=True', 'train.zero_optimization.stage=3', 'train.log_period=1']' died with . > F20220301 16:13:39.021886 52842 ctrl_client.cpp:54] Check failed: rpc_client_.GetStubAt(i)->CallMethod( &client_ctx, request,...
> > 这个属于上一个连接还没释放吧,过一会再跑呢? > > 一样的, 还是有这个错误. 我感觉可能是多卡的报错导致了这个输出信息? 看错了,具体错误应该是这个: ``` F20220301 08:31:15.972954 77402 exec_graph.cpp:117] File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/graph/exec_graph.cpp", line 117, in InferBlobDescs op_->InferBlobDescsIf(GetBlobDesc4BnInOp, parallel_ctx, &GlobalJobDesc()) File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/operator/operator.cpp", line 324, in InferBlobDescsIf InferOutBlobDescsIf(GetBlobDesc4BnInOp,...
@Ldpe2G 这个能不能提供下官方测试的脚本和容器环境?还有数据集。我这边测试也需要,另外,swin-transformer仓库跟libai里的有区别么?
## oneflow16+oneflow15 - [x] oneflow16机器上单机正常情况下(不停止IB服务),init rdma ``` state: PORT_ACTIVE (4) ``` - [x] oneflow16机器关闭IB服务`/etc/init.d/openibd stop`,用户开启init rdma ``` W20220920 10:44:56.579414 1919887 env_global_objects_scope.cpp:279] Skip init RDMA because RDMA is unavailable! W20220920 10:44:56.632112...
使用`/etc/init.d/openibd stop`停止后,无论用户有没有init rdma都会报错。 ``` ibv_devinfo Failed to get IB devices list: Function not implemented ibstatus Fatal error: No devices /usr/sbin/ibstatus: 21: exit: Illegal number: -1 ``` - 以下是nccl debug信息: ```...
上述问题继续测试了一下: 指定网卡后,export NCCL_SOCKET_IFNAME=eno1,(但这个IP也是IB网卡的IP,没有其他IP) 一台停止IB,一台不停止,nccl log日志显示,一台Using network IB,停止IB驱动的一台会卡主,最终导致多机运行卡了。 两台都停止IB,指定网口是可以跑通的。 感觉这个pr是不是没有必要,因为nccl这边在一台机器上默认会去使用IB。 当然,一台停止IB的情况,可以去其他机器上区分IB网卡IP的机器上试试。
### 2机正确性测试  正确性没问题, @lixinqi
### eager测试 - 机器: oneflow28 NVIDIA GeForce RTX 3080 Ti - 磁盘:ssd | Case | [check_xx@2a23745](https://github.com/Oneflow-Inc/oneflow/tree/check_xx) | [master@a3841f5](https://github.com/Oneflow-Inc/oneflow/commit/a3841f5) | | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | | ResNet50_DCcpu_FP32_mb96_gb96_acc1_1n1g | [390.69](https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/3080TI/ResNet50/oneflow-28/2a23745/1n1g/resnet50_ddp_realdata_DCcpu_FP32_mb96_gb96_acc1_1n1g_2a23745_20221102_083135574759563.log)...
### eager测试 - 机器: oneflow28 NVIDIA GeForce RTX 3080 Ti - 磁盘:ssd | Case | [check_xx@aa489bc](https://github.com/Oneflow-Inc/oneflow/tree/check_xx) | [master@a3841f5](https://github.com/Oneflow-Inc/oneflow/commit/a3841f5) | | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | | ResNet50_DCcpu_FP32_mb96_gb96_acc1_1n1g | [399.62](https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/3080TI/ResNet50/oneflow-28/aa489bc/1n1g/resnet50_ddp_realdata_DCcpu_FP32_mb96_gb96_acc1_1n1g_aa489bc_20221107_031407122908971.log)...
> > python3 -m oneflow --doctor > > 您好,我的oneflow版本是0.8.0,我的torch gpu可以正常工作。但是运行还是没有出结果 您能粘贴一下输出吗?想看一下安装的cuda版本。以及nvidia-smi结果中的Driver Version