SharedFuture from async_send_request never becomes valid
Bug report
Required Info:
- Operating System:
- Ubuntu 22.04.1 LTS (osrf/ros:humble-desktop docker image)
- Installation type:
- from source
- Version or commit hash:
- 4fa3489cfd075affb34812878b034ef8a462e379
- DDS implementation:
- both fastrtps and cyclonedds
- Client library (if applicable):
- rclcpp
Steps to reproduce issue
Build and run the unit test below
#include <gtest/gtest.h>
#include <rclcpp/rclcpp.hpp>
#include <std_srvs/srv/set_bool.hpp>
#include <string>
using namespace std::chrono;
class TestRegularService : public ::testing::Test
{
public:
void SetUp() override
{
rclcpp::init(0, nullptr);
m_node_executor = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();
// Create a node
auto opt = rclcpp::NodeOptions();
auto node = rclcpp::Node::make_shared("server_node", opt);
m_nodes.push_back(node);
m_node_executor->add_node(node->get_node_base_interface());
m_service = rclcpp::create_service<std_srvs::srv::SetBool>(
node->get_node_base_interface(),
node->get_node_services_interface(),
"example_service",
[node](const std::shared_ptr<std_srvs::srv::SetBool::Request> request,
const std::shared_ptr<std_srvs::srv::SetBool::Response> response) {
static unsigned int serial_num = 1;
(void)request;
RCLCPP_INFO(node->get_logger(), "Received service client request... Sending response: %d", serial_num);
response->success = true;
response->message = std::to_string(serial_num++);
},
rclcpp::ServicesQoS().get_rmw_qos_profile(),
nullptr);
node = rclcpp::Node::make_shared("client_node", opt);
m_nodes.push_back(node);
m_node_executor->add_node(node);
m_client = node->create_client<std_srvs::srv::SetBool>("example_service");
m_node_thread = std::thread(std::bind([this]()
{
this->m_node_executor->spin();
}));
}
~TestRegularService()
{
if (rclcpp::ok())
{
rclcpp::shutdown();
}
m_node_thread.join();
}
std::vector<std::shared_ptr<rclcpp::Node>> m_nodes;
std::shared_ptr<rclcpp::executors::SingleThreadedExecutor> m_node_executor;
rclcpp::ServiceBase::SharedPtr m_service;
rclcpp::Client<std_srvs::srv::SetBool>::SharedPtr m_client;
std::thread m_node_thread;
};
TEST_F(TestRegularService, test_regular_service)
{
auto request = std::make_shared<std_srvs::srv::SetBool::Request>();
request->data = true;
while (!m_client->wait_for_service(1s)) {
if (!rclcpp::ok()) {
RCLCPP_ERROR(m_nodes.front()->get_logger(), "Interrupted while waiting for the service. Exiting.");
return;
}
RCLCPP_INFO(m_nodes.front()->get_logger(), "service not available, waiting again...");
}
uint counter = 1;
do {
RCLCPP_INFO(m_nodes.front()->get_logger(), "Sending request: %d" , counter);
auto result = m_client->async_send_request(request);
RCLCPP_INFO(m_nodes.front()->get_logger(), "Waiting for response: %d" , counter);
auto answer = result.get();
ASSERT_TRUE(answer->success);
RCLCPP_INFO(m_nodes.front()->get_logger(), "Got response: %s", answer->message.c_str());
}while(++counter<1000000);
}
int main(int argc, char **argv)
{
::testing::InitGoogleTest(&argc, argv);
return RUN_ALL_TESTS();
}
Expected behavior
The unit test is expected to send an async request and a valid result will be retrieved with SharedFuture::get() later. The unit should repeat the process 1000000 times without issues.
Actual behavior
The unit test hangs randomly regardless of the underline DDS. Example snippet:
Start testing: Oct 31 20:35 UTC
----------------------------------------------------------
71/105 Testing: test-regular-ros2-service
71/105 Test: test-regular-ros2-service
Command: "/usr/bin/python3.10" "-u" "/ros2_humble/install/ament_cmake_test/share/ament_cmake_test/cmake/run_test.py" "/ros2_humble/build/rclcpp/test_results/rclcpp/test-regular-ros2-service.gtest.xml" "--package-name" "rclcpp" "--output-file" "/ros2_humble/build/rclcpp/ament_cmake_gtest/test-regular-ros2-service.txt" "--command" "/ros2_humble/build/rclcpp/test/rclcpp/test-regular-ros2-service" "--gtest_output=xml:/ros2_humble/build/rclcpp/test_results/rclcpp/test-regular-ros2-service.gtest.xml"
Directory: /ros2_humble/build/rclcpp/test/rclcpp
"test-regular-ros2-service" start time: Oct 31 20:35 UTC
Output:
----------------------------------------------------------
-- run_test.py: invoking following command in '/ros2_humble/build/rclcpp/test/rclcpp':
- /ros2_humble/build/rclcpp/test/rclcpp/test-regular-ros2-service --gtest_output=xml:/ros2_humble/build/rclcpp/test_results/rclcpp/test-regular-ros2-service.gtest.xml
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TestRegularService
[ RUN ] TestRegularService.test_regular_service
[INFO] [1667248559.329850983] [server_node]: Sending request: 1
[INFO] [1667248559.329986353] [server_node]: Waiting for response: 1
[INFO] [1667248559.330659463] [server_node]: Received service client request... Sending response: 1
[INFO] [1667248559.330995192] [server_node]: Got response: 1
[INFO] [1667248559.331019237] [server_node]: Sending request: 2
[INFO] [1667248559.331041372] [server_node]: Waiting for response: 2
[INFO] [1667248559.331233045] [server_node]: Received service client request... Sending response: 2
[INFO] [1667248559.331454604] [server_node]: Got response: 2
[INFO] [1667248559.331478869] [server_node]: Sending request: 3
[INFO] [1667248559.331500609] [server_node]: Waiting for response: 3
[INFO] [1667248559.331623666] [server_node]: Received service client request... Sending response: 3
[INFO] [1667248559.331833534] [server_node]: Got response: 3
[INFO] [1667248559.331858362] [server_node]: Sending request: 4
[INFO] [1667248559.331879729] [server_node]: Waiting for response: 4
[INFO] [1667248559.331984129] [server_node]: Received service client request... Sending response: 4
[INFO] [1667248559.332344850] [server_node]: Got response: 4
.......
.......
.......
[INFO] [1667248559.887715670] [server_node]: Sending request: 2704
[INFO] [1667248559.887733614] [server_node]: Waiting for response: 2704
[INFO] [1667248559.887767825] [server_node]: Received service client request... Sending response: 2704
[INFO] [1667248559.887899850] [server_node]: Got response: 2704
[INFO] [1667248559.887913152] [server_node]: Sending request: 2705
[INFO] [1667248559.887931926] [server_node]: Waiting for response: 2705
[INFO] [1667248559.887967492] [server_node]: Received service client request... Sending response: 2705
[INFO] [1667248559.888097595] [server_node]: Got response: 2705
[INFO] [1667248559.888119401] [server_node]: Sending request: 2706
[INFO] [1667248559.888141074] [server_node]: Waiting for response: 2706
<end of output>
Test time = 60.06 sec
----------------------------------------------------------
Test Failed.
"test-regular-ros2-service" end time: Oct 31 20:36 UTC
"test-regular-ros2-service" time elapsed: 00:01:00
----------------------------------------------------------
End testing: Oct 31 20:36 UTC
gtest = 60.06 sec*proc
Using the Intra-Process services and clients from https://github.com/ros2/rclcpp/pull/1847, it's possible to hot-fix the issue for single-process applications (i.e. enabling IPC on all clients/servers pairs). However it's unclear what's the cause of this issue.
It's hanged at result.get(), I have update the code with rclcpp::spin_until_future_complete to wait for incoming messages, and it works well, the code looks like bellow:
...
node = rclcpp::Node::make_shared("client_node", opt);
m_nodes.push_back(node);
//m_node_executor->add_node(node);
m_client = node->create_client<std_srvs::srv::SetBool>("example_service");
...
do {
RCLCPP_INFO(m_nodes.front()->get_logger(), "Sending request: %d" , counter);
auto result = m_client->async_send_request(request);
RCLCPP_INFO(m_nodes.front()->get_logger(), "Waiting for response: %d" , counter);
if (rclcpp::spin_until_future_complete(m_nodes.back(), result) ==
rclcpp::FutureReturnCode::SUCCESS)
{
auto answer = result.get();
RCLCPP_INFO(m_nodes.front()->get_logger(), "Got response: %s", answer->message.c_str());
}
...
The case makes sense to me.
The reason why future.get() is blocked is that promise does not set_value in handle_response.
If the request from client to service and the response from service to client is quick enough that done before putting the sequence_number into the pending_requests_, calling get_and_erase_pending_request in the handle_response can't get that there was the sequence_number requested before.
I think that https://github.com/ros2/rclcpp/blob/7c6785176a878cf16eb98995b732f405b9e776ed/rclcpp/include/rclcpp/client.hpp#L800 should put before calling rcl_send_request.
@iuhilnehc-ynos Thanks for your tip, I have created PR https://github.com/ros2/rclcpp/pull/2044 to fix this issue.
@llapx https://github.com/ros2/rclcpp/pull/2044 looks good to me, as @iuhilnehc-ynos mentioned, this could happen in theory.
just in case, can you make sure https://github.com/ros2/rclcpp/issues/2039#issuecomment-1303160571 is the case for this issue with using gdb or additional logging message? i think we want to know the root cause before fixing the problem.
@fujitatomoya
can you make sure https://github.com/ros2/rclcpp/issues/2039#issuecomment-1303160571 is the case for this issue
Confirmed, and it works well (the test case passed, no hang happend).
with using gdb or additional logging message?
Not yet, I will have a try.
@llapx, I cherry-picked your fix commit ece7a0eb1b1a656aa2241d71a06c3904f4bcb448 to humble but couldn't get the same result. The example still locks up for me. Please see attached a gdb backtrace. Thanks!
gdb.txt
@jefferyyjhsu
I retest it on humble branch, and it works well (both on latest rolling & humble branch):
...
100: [INFO] [1667963086.523220110] [server_node]: Received service client request... Sending response: 999999
100: [INFO] [1667963086.523367484] [server_node]: Got response: 999999
100: [ OK ] TestRegularService.test_regular_service (285302 ms)
100: [----------] 1 test from TestRegularService (285302 ms total)
100:
100: [----------] Global test environment tear-down
100: [==========] 1 test from 1 test suite ran. (285302 ms total)
100: [ PASSED ] 1 test.
100: -- run_test.py: return code 0
100: -- run_test.py: inject classname prefix into gtest result file '/work/rclcpp_issues_2039/build/rclcpp/test_results/rclcpp/test_regular_service.gtest.xml'
100: -- run_test.py: verify result file '/work/rclcpp_issues_2039/build/rclcpp/test_results/rclcpp/test_regular_service.gtest.xml'
1/1 Test #100: test_regular_service ............. Passed 285.49 sec
The following tests passed:
test_regular_service
100% tests passed, 0 tests failed out of 1
Label Time Summary:
gtest = 285.49 sec*proc (1 test)
Total Test time (real) = 286.08 sec
Finished <<< rclcpp [8min 4s]
Summary: 1 package finished [8min 4s]
My test env:
ros:humble on docker rclcpp humble branch with latest commit ae8b033ae01601475104bf9e4c2350b98c9560c4
Hi @llapx, I did some more tests today. It turns out that the test now passes when using cyclonedds. However, it still fails with fastrtps. Do you happen to be using cyclonedds as well?
@jefferyyjhsu
I used default env of latest ros:humble docker (with some extra packages installed for compiling rclcpp):
root@csc:/work/rclcpp_issues_2039# dpkg -l | grep fastrtps
ii ros-humble-fastrtps 2.6.1-1jammy.20220620.173333 amd64 *eprosima Fast DDS* (formerly Fast RTPS) is a C++ implementation of the DDS (Data Distribution Service) standard of the OMG (Object Management Group).
ii ros-humble-fastrtps-cmake-module 2.2.0-2jammy.20220520.001034 amd64 Provide CMake module to find eProsima FastRTPS.
ii ros-humble-rmw-fastrtps-cpp 6.2.1-2jammy.20220620.175641 amd64 Implement the ROS middleware interface using eProsima FastRTPS static code generation in C++.
ii ros-humble-rmw-fastrtps-shared-cpp 6.2.1-2jammy.20220620.175144 amd64 Code shared on static and dynamic type support of rmw_fastrtps_cpp.
ii ros-humble-rosidl-typesupport-fastrtps-c 2.2.0-2jammy.20220520.005614 amd64 Generate the C interfaces for eProsima FastRTPS.
ii ros-humble-rosidl-typesupport-fastrtps-cpp 2.2.0-2jammy.20220520.005259 amd64 Generate the C++ interfaces for eProsima FastRTPS.
root@csc:/work/rclcpp_issues_2039# dpkg -l | grep cyclone
@jefferyyjhsu, thanks for the confirmation to let me know there is still an issue for rmw_fastrtps. I did the test yesterday, and I can reproduce the issue using rmw_fastrtps in whatever the humble or rolling (I didn't use docker to test).
From the https://github.com/ros2/rclcpp/issues/2039#issuecomment-1309400835, I think that the issue for rmw_cyclonedds_cpp is fixed by https://github.com/ros2/rclcpp/pull/2044. (of course, it's a rclcpp's bug)
I am afraid that another bug exists in the Fast-DDS, but it needs to confirm by eProsima.
I'd like to share my understanding below,
https://github.com/eProsima/Fast-DDS/blob/cafd896e0e786e1af49f8f953e0843cc10780d29/src/cpp/fastdds/core/condition/WaitSetImpl.cpp#L159 doesn't hold WaitSetImpl::mutex_, but called in https://github.com/eProsima/Fast-DDS/blob/cafd896e0e786e1af49f8f953e0843cc10780d29/src/cpp/fastdds/core/condition/ConditionNotifier.cpp#L58.
https://github.com/eProsima/Fast-DDS/blob/cafd896e0e786e1af49f8f953e0843cc10780d29/src/cpp/fastdds/core/condition/WaitSetImpl.cpp#L131 might lose a condition notification when condition_variable is waiting for the fill_active_conditions.
A case such as https://github.com/eProsima/Fast-DDS/blob/cafd896e0e786e1af49f8f953e0843cc10780d29/src/cpp/fastdds/core/condition/WaitSetImpl.cpp#L123 with ret_val(false), and then there is a notification triggered at https://github.com/eProsima/Fast-DDS/blob/cafd896e0e786e1af49f8f953e0843cc10780d29/src/cpp/fastdds/core/condition/StatusConditionImpl.cpp#L90.
A possible solution,
diff --git a/src/cpp/fastdds/core/condition/WaitSetImpl.cpp b/src/cpp/fastdds/core/condition/WaitSetImpl.cpp
index 73390e35d..ecd74b735 100644
--- a/src/cpp/fastdds/core/condition/WaitSetImpl.cpp
+++ b/src/cpp/fastdds/core/condition/WaitSetImpl.cpp
@@ -68,7 +68,7 @@ ReturnCode_t WaitSetImpl::attach_condition(
// Should wake_up when adding a new triggered condition
if (is_waiting_ && condition.get_trigger_value())
{
- wake_up();
+ cond_.notify_one();
}
}
}
@@ -156,6 +156,7 @@ ReturnCode_t WaitSetImpl::get_conditions(
void WaitSetImpl::wake_up()
{
+ std::lock_guard<std::mutex> guard(mutex_);
cond_.notify_one();
}
could you help double-check if the issue still happened after applying the above patch for Fast-DDS?
@MiguelCompany @fujitatomoya @llapx Do you have any suggestions?
@iuhilnehc-ynos I think you may be right. The fix seems an easy one, but adding a regression test might be difficult. I'll try to think of something ...
Thanks @iuhilnehc-ynos, I tested the proposed fix and the example now passes reliably.
@iuhilnehc-ynos https://github.com/ros2/rclcpp/issues/2039#issuecomment-1310024990 is deep, but i think the proposed change makes sense and actually fixes the problem reported by this issue.
@iuhilnehc-ynos can you make PR for Fast-DDS?
@fujitatomoya
Based on https://github.com/ros2/rclcpp/issues/2039#issuecomment-1310069608, I think @MiguelCompany will create a PR with a regression test for it by himself.
@iuhilnehc-ynos Please open the PR, and we'll add the regression test on it
Just a note: this seems to be the same issue reported here https://github.com/osrf/docker_images/issues/628 Seems to be currently fixed if using osrf/ros2:nightly with teh default DDS but not on the rolling docker image yet.
I think we can close this since,
- https://github.com/eProsima/Fast-DDS/pull/3087
- https://github.com/eProsima/Fast-DDS/pull/3195
are merged now. please reopen the issue or another one if anything missing.