`run_pipeline.py` hangs on Open3D's `iou_bev` invocation in PyTorch
The iou_bev invocation inside datasets/utils/operations.py's box_collision_test() hangs on invocation with no CPU or memory usage, blocking all forward progress.
This can be reproduced using the following Dockerfile, which codifies the instructions provided in the README, and then the following command run inside the container with this repo mounted at /Open3D-ML:
python scripts/run_pipeline.py torch -c ml3d/configs/pointpillars_kitti.yml --dataset_path /data/kitti_second/kitti/ --pipeline ObjectDetection
which produces
Using external Open3D-ML in /Open3D-ML
regular arguments
batch_size: null
cfg_dataset: null
cfg_file: ml3d/configs/pointpillars_kitti.yml
cfg_model: null
cfg_pipeline: null
ckpt_path: null
dataset: null
dataset_path: /data/kitti_second/kitti/
device: gpu
framework: torch
main_log_dir: null
max_epochs: null
mode: null
model: null
pipeline: ObjectDetection
split: train
extra arguments
{}
INFO - 2021-08-05 04:50:41,756 - object_detection - DEVICE : cuda
INFO - 2021-08-05 04:50:41,756 - object_detection - Logging in file : ./logs/PointPillars_KITTI_torch/log_train_2021-08-05_04:50:41.txt
INFO - 2021-08-05 04:50:41,756 - kitti - Found 3712 pointclouds for training
INFO - 2021-08-05 04:50:41,757 - object_detection - Initializing from scratch.
INFO - 2021-08-05 04:50:41,758 - object_detection - Writing summary in train_log/00068_PointPillars_KITTI_torch.
INFO - 2021-08-05 04:50:41,759 - object_detection - Started training
INFO - 2021-08-05 04:50:41,759 - object_detection - === EPOCH 0/200 ===
INFO - 2021-08-05 04:50:41,759 - object_detection - after model.train()
training: 0%| | 0/834 [00:00<?, ?it/s]
and then makes no forward progress. The installed open3d version is 0.13.0, as per
root@3c175ef83193:/Open3D-ML# python -c "import open3d as o3d; print(o3d.__version__)"
Using external Open3D-ML in /Open3D-ML
0.13.0
This was run on a machine with Driver Version: 460.80, CUDA Version 11.2 without a base CUDA install (i.e. no nvcc; I manage CUDA installs via conda):
Failed attempts to fix the issue:
-
condaandpipversion ofopen3d -
isl-orgcompiled andcondaversion of PyTorch1.7.1 - Base images with CUDA 10.1 (
nvidia/cuda:10.1-cudnn8-devel-ubuntu18.04) and 11.1 (nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04) - Default instructions inside of a standard
condaenvironment (not inside a docker container) - Python 3.6 and 3.8
A trivial standalone program that invokes this IoU function inside the given docker container operates correctly. The following script
import numpy as np
from open3d.ml.contrib import iou_3d_cpu
a1 = np.array([[0.,0.,0.,1.,1.,1.,0.]]).astype(np.float32)
a2 = np.array([[0.,0.,0.,1.,1.,1.,0.]]).astype(np.float32)
a3 = np.array([[3.,0.,0.,1.,1.,1.,0.]]).astype(np.float32)
print(iou_3d_cpu(a1, a2))
print(iou_3d_cpu(a1, a3))
correctly prints
Using external Open3D-ML in /Open3D-ML
[[1.]]
[[0.]]
Additionally, I have tried a trivial train pipeline inside another folder in order to avoid a possible namespace issue, but this did not fix the issue. I have noticed that when OPEN3D_ML_ROOT is not set, it uses another install of Open3D-ML provided as part of open3d, but the issue still persists.
I was able to reproduce this error on a friend's system using my Dockerfile, making me think this is not an issue with my base machine.
I tried compiling Open3D from source inside the docker container. I was able to get it to compile if I did not compile the ML library, causing the training pipeline to fail due to a failed import, but if I tried to compile that none of the targets were able to succeed.
I am going to reimplement the iou_bev function, as that seems to be the only function blocking training.
As per @sanskar107's suggestion, adding --pipeline.num_workers 0 serves as a viable workaround.
As suggested, the root cause seems to be any custom op in the PyTorch dataloader, not just Open3D's iou_bev function, as my own iou_bev, a Python wrapper around a C++ implementation, also causes the same hanging issue.
@kylevedder Thanks for reporting this. The main problem seems to be related to this issue https://github.com/pytorch/pytorch/issues/46409 Could you try the workarounds mentioned there and create a pull request if anything works?
Setting OMP_NUM_THREADS=1 and running with the default number of workers fixes this issue and has significantly higher throughput than setting ---pipeline.num_workers 0. I will investigate this thread more later.