ros2doctor test_api.py hanging during nightly linux repeated jobs.
Bug report
Required Info:
- Operating System:
- Ubuntu Focal
- RHEL 8
- Installation type:
- source
- Version or commit hash:
- https://gist.github.com/nuclearsandwich/ee0b27ec978bdc3e40d0fb8cd52edf26#file-nightly_linux_repeated-2549-repos
- DDS implementation:
- cyclonedds (presumably)
- Client library (if applicable):
- N/A
The following jobs hung indefinitely (upwards of 24 hours in some cases) with the last output being ros2doctor's test_api.py tests.
Starting >>> ros2doctor
07:28:05 ============================= test session starts ==============================
07:28:05 platform linux -- Python 3.8.10, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
07:28:05 cachedir: /home/jenkins-agent/workspace/nightly_linux_repeated/ws/build/ros2doctor/.pytest_cache
07:28:05 rootdir: /home/jenkins-agent/workspace/nightly_linux_repeated/ws/src/ros2/ros2cli, configfile: pytest.ini
07:28:05 plugins: launch-testing-ros-0.17.0, launch-testing-0.21.0, ament-xmllint-0.11.4, ament-copyright-0.11.4, ament-pep257-0.11.4, ament-flake8-0.11.4, ament-lint-0.11.4, mock-3.7.0, rerunfailures-10.2, cov-3.0.0, repeat-0.9.1, colcon-core-0.7.1
07:28:06 collecting ...
07:28:06 collected 15 items / 12 deselected / 3 selected
07:28:06
07:28:08 test/test_api.py .
- https://ci.ros2.org/job/nightly_linux-rhel_repeated/1022
- https://ci.ros2.org/job/nightly_linux-rhel_repeated/1023
- https://ci.ros2.org/job/nightly_linux_repeated/2549
It happened today as well: https://ci.ros2.org/job/nightly_linux-rhel_repeated/1024/
A wild guess is that the change from CycloneDDS -> Fast-DDS as the default RMW is causing this. If it is reproducible, I'll suggest running a CI build against a branch where https://github.com/ros2/rmw/pull/315 is reverted to see if it helps.
But that PR is for some days ago, I don't think it's related. Although I don't see any differences between the repos on the first case and the previous one.
I'll run a check to see if I can reproduce it only using ros2doctor on rhel, which seems to be the current reliable case.
Current hypothesis is that actually ros2action fails, leaves in a "unrecoverable state" and then ros2doctor. Running CI again, this time with ros2action as well.
Adding launch_testing_ros:
(Not new clues with these green results)
Another new instance I had to kill: https://ci.ros2.org/job/nightly_linux-aarch64_repeated/1858/
~~Happened today to me as well, trying to build ROS Foxy binaries for Debian Buster, hanging for over an hour on test_cli.py of ros2topic.~~
I just had to wait a little longer, I'm dealing with a somewhat slow build server.
Newer cases:
https://ci.ros2.org/job/nightly_linux_repeated/2560/ https://ci.ros2.org/job/nightly_linux-aarch64_repeated/1867/
I aborted the below jobs due to these hangs.
- https://ci.ros2.org/job/nightly_linux_repeated/2561/
- https://ci.ros2.org/job/nightly_linux-rhel_repeated/1035
Another instance I have just aborted: https://ci.ros2.org/job/nightly_linux-aarch64_repeated/1875/
https://ci.ros2.org/job/nightly_linux-rhel_repeated/1045/
I thought about closing this one and then I saw this in the ci_windows: https://ci.ros2.org/job/ci_windows/16645/
I am not sure it's exactly the same error, it fails earlier, but It's probably related.
EDIT: I don't think it's directly related as the attached job is testing a custom ci branch using foxy.