pytorch icon indicating copy to clipboard operation
pytorch copied to clipboard

trunk / libtorch-linux-bionic-cuda11.6-py3.7-gcc7 / build is Flaky

Open zengk95 opened this issue 3 years ago • 2 comments

🐛 Describe the bug

Over the past day, this job has been failing randomly after 20 minutes for memory issues. We are now disabling it in #82862 until this gets fixed.

Example workfllows: https://github.com/pytorch/pytorch/runs/7695112482?check_suite_focus=true https://github.com/pytorch/pytorch/runs/7695008056?check_suite_focus=true

2022-08-05T02:06:03.3088128Z Compiling  onerank_reduce.cu                   > /var/lib/jenkins/cpp-build/caffe2/build/nccl/obj/collectives/device/onerank_reduce.o
2022-08-05T02:07:01.4785047Z virtual memory exhausted: Cannot allocate memory
2022-08-05T02:07:01.5035469Z Makefile:73: recipe for target '/var/lib/jenkins/cpp-build/caffe2/build/nccl/obj/collectives/device/devlink.o' failed
2022-08-05T02:07:01.5035995Z make[5]: *** [/var/lib/jenkins/cpp-build/caffe2/build/nccl/obj/collectives/device/devlink.o] Error 1
2022-08-05T02:07:01.5036437Z make[5]: Leaving directory '/var/lib/jenkins/workspace/third_party/nccl/nccl/src/collectives/device'
2022-08-05T02:07:01.5064220Z Makefile:51: recipe for target '/var/lib/jenkins/cpp-build/caffe2/build/nccl/obj/collectives/device/colldevice.a' failed
2022-08-05T02:07:01.5064718Z make[4]: *** [/var/lib/jenkins/cpp-build/caffe2/build/nccl/obj/collectives/device/colldevice.a] Error 2
2022-08-05T02:07:01.5065128Z make[4]: Leaving directory '/var/lib/jenkins/workspace/third_party/nccl/nccl/src'
2022-08-05T02:07:01.5065471Z Makefile:25: recipe for target 'src.build' failed
2022-08-05T02:07:01.5065681Z make[3]: *** [src.build] Error 2
2022-08-05T02:07:01.5066000Z make[3]: Leaving directory '/var/lib/jenkins/workspace/third_party/nccl/nccl'
2022-08-05T02:07:01.5066652Z CMakeFiles/nccl_external.dir/build.make:85: recipe for target 'nccl_external-prefix/src/nccl_external-stamp/nccl_external-build' failed
2022-08-05T02:07:01.5067116Z make[2]: *** [nccl_external-prefix/src/nccl_external-stamp/nccl_external-build] Error 2
2022-08-05T02:07:01.5196900Z make[2]: Leaving directory '/var/lib/jenkins/cpp-build/caffe2/build'
2022-08-05T02:07:01.5198235Z CMakeFiles/Makefile2:2043: recipe for target 'CMakeFiles/nccl_external.dir/all' failed
2022-08-05T02:07:01.5198664Z make[1]: *** [CMakeFiles/nccl_external.dir/all] Error 2
2022-08-05T02:07:01.5268130Z make[1]: *** Waiting for unfinished jobs....
2022-08-05T02:07:33.1381210Z make[2]: Leaving directory '/var/lib/jenkins/cpp-build/caffe2/build'
2022-08-05T02:07:33.2082428Z [ 86%] Built target torch_cpu
2022-08-05T02:07:33.2089935Z make[1]: Leaving directory '/var/lib/jenkins/cpp-build/caffe2/build'
2022-08-05T02:07:33.2092743Z Makefile:145: recipe for target 'all' failed
2022-08-05T02:07:33.2092986Z make: *** [all] Error 2
2022-08-05T02:07:33.3882379Z + sccache_epilogue
2022-08-05T02:07:33.3882944Z + echo '::group::Sccache Compilation Log'
<probably uninteresting folded group, click to show>
2022-08-05T02:07:33.6608891Z ##[error]Process completed with exit code 1.
image

Versions

master

cc @jbschlosser @seemethere @malfet @pytorch/pytorch-dev-infra @janeyx99

zengk95 avatar Aug 05 '22 16:08 zengk95

https://github.com/pytorch/pytorch/pull/82775 is the culprit...

janeyx99 avatar Aug 09 '22 20:08 janeyx99

Hmm, it is run with 6 cores:

cd /var/lib/jenkins/workspace/third_party/nccl/nccl && env CCACHE_DISABLE=1 SCCACHE_DISABLE=1 make CXX=/opt/cache/bin/c++ CUDA_HOME=/usr/local/cuda NVCC=/usr/local/cuda/bin/nvcc NVCC_GENCODE=-gencode=arch=compute_52,code=sm_52 BUILDDIR=/var/lib/jenkins/cpp-build/caffe2/build/nccl VERBOSE=0 -j 6

malfet avatar Aug 09 '22 22:08 malfet