ros2cli ros2doctor test_api.py hanging during nightly linux repeated jobs.

Bug report

Required Info:

Operating System:
- Ubuntu Focal
- RHEL 8
Installation type:
- source
Version or commit hash:
- https://gist.github.com/nuclearsandwich/ee0b27ec978bdc3e40d0fb8cd52edf26#file-nightly_linux_repeated-2549-repos
DDS implementation:
- cyclonedds (presumably)
Client library (if applicable):
- N/A

The following jobs hung indefinitely (upwards of 24 hours in some cases) with the last output being ros2doctor's test_api.py tests.

Starting >>> ros2doctor
07:28:05 ============================= test session starts ==============================
07:28:05 platform linux -- Python 3.8.10, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
07:28:05 cachedir: /home/jenkins-agent/workspace/nightly_linux_repeated/ws/build/ros2doctor/.pytest_cache
07:28:05 rootdir: /home/jenkins-agent/workspace/nightly_linux_repeated/ws/src/ros2/ros2cli, configfile: pytest.ini
07:28:05 plugins: launch-testing-ros-0.17.0, launch-testing-0.21.0, ament-xmllint-0.11.4, ament-copyright-0.11.4, ament-pep257-0.11.4, ament-flake8-0.11.4, ament-lint-0.11.4, mock-3.7.0, rerunfailures-10.2, cov-3.0.0, repeat-0.9.1, colcon-core-0.7.1
07:28:06 collecting ... 
07:28:06 collected 15 items / 12 deselected / 3 selected                                
07:28:06 
07:28:08 test/test_api.py .

https://ci.ros2.org/job/nightly_linux-rhel_repeated/1022
https://ci.ros2.org/job/nightly_linux-rhel_repeated/1023
https://ci.ros2.org/job/nightly_linux_repeated/2549

Jan 31 '22 22:01 nuclearsandwich

It happened today as well: https://ci.ros2.org/job/nightly_linux-rhel_repeated/1024/

Feb 01 '22 15:02 Blast545

A wild guess is that the change from CycloneDDS -> Fast-DDS as the default RMW is causing this. If it is reproducible, I'll suggest running a CI build against a branch where https://github.com/ros2/rmw/pull/315 is reverted to see if it helps.

Feb 01 '22 15:02 clalancette

But that PR is for some days ago, I don't think it's related. Although I don't see any differences between the repos on the first case and the previous one.

I'll run a check to see if I can reproduce it only using ros2doctor on rhel, which seems to be the current reliable case.

Feb 01 '22 18:02 Blast545

Current hypothesis is that actually ros2action fails, leaves in a "unrecoverable state" and then ros2doctor. Running CI again, this time with ros2action as well.

Adding launch_testing_ros: (Not new clues with these green results)

Feb 01 '22 20:02 Blast545

Another new instance I had to kill: https://ci.ros2.org/job/nightly_linux-aarch64_repeated/1858/

Feb 02 '22 14:02 Blast545

~~Happened today to me as well, trying to build ROS Foxy binaries for Debian Buster, hanging for over an hour on test_cli.py of ros2topic.~~ I just had to wait a little longer, I'm dealing with a somewhat slow build server.

Feb 03 '22 16:02 xander-m2k

Newer cases:

https://ci.ros2.org/job/nightly_linux_repeated/2560/ https://ci.ros2.org/job/nightly_linux-aarch64_repeated/1867/

Feb 11 '22 16:02 Blast545

I aborted the below jobs due to these hangs.

https://ci.ros2.org/job/nightly_linux_repeated/2561/
https://ci.ros2.org/job/nightly_linux-rhel_repeated/1035

Feb 12 '22 17:02 nuclearsandwich

Another instance I have just aborted: https://ci.ros2.org/job/nightly_linux-aarch64_repeated/1875/

Feb 21 '22 16:02 Blast545

https://ci.ros2.org/job/nightly_linux-rhel_repeated/1045/

Feb 22 '22 15:02 Blast545

I thought about closing this one and then I saw this in the ci_windows: https://ci.ros2.org/job/ci_windows/16645/ I am not sure it's exactly the same error, it fails earlier, but It's probably related.

EDIT: I don't think it's directly related as the attached job is testing a custom ci branch using foxy.

Mar 04 '22 16:03 Blast545