the subscriber cannot receive data normally [12226]
Expected Behavior
Current Behavior
In one case, the publisher has been sending data, and the subscriber is receiving data. Suddenly closing the receiver and starting the program again, the subscriber cannot receive data normally.
Steps to Reproduce
System information
- Fast-RTPS version:
- OS:win7-x64
- Network interfaces:
- ROS2:
Additional context
Additional resources
- Wireshark capture
- XML profiles file
In one case, the publisher has been sending data, and the subscriber is receiving data. Suddenly closing the receiver and starting the program again, the subscriber cannot receive data normally.
Which case? How to reproduce?
With so little information there's not much we can do to help.
Fast-RTPS version:2.3.3 OS:win7-w32\win10-win32 demo.zip
When I suddenly stopped the subscriber while I was communicating and started the subscriber again, I found that I couldn't communicate properly.
Sometimes it's a subscription to match but no data communication, sometimes it's a subscription that doesn't
When I publish multiple topics in a program, no exception was found when I started the subscription program for the first time, but when I closed the subscription program and started the program again, I might find that some topic subscriptions failed. After analysis, it is found that if the program is suddenly closed when the publisher and the subscriber are communicating, the subscriber will fail to subscribe the next time it is started.
- Why does the above situation exist? What caused this?
- How can I successfully subscribe again after the program exits abnormally?
IDL: struct SimMessage { unsigned long id; unsigned long long time_ms; string dest; string src; string type; string subtype; sequence<octet, 24000> data; };
I found that the following code can significantly reduce the probability of this failure, but the failure still occurs after many attempts. But I don't know the reason, can you help me?
pqos.wire_protocol().builtin.readerPayloadSize = 1024 * 1024 * 1; pqos.wire_protocol().builtin.readerHistoryMemoryPolicy = DYNAMIC_REUSABLE_MEMORY_MODE; pqos.wire_protocol().builtin.writerPayloadSize = 1024 * 5; pqos.wire_protocol().builtin.writerHistoryMemoryPolicy = DYNAMIC_REUSABLE_MEMORY_MODE;
@libfsw Thank you for the additional information and the demo code.
If modifying the configuration of the builtin protocols is making things work better, it should then be related to the discovery / matching of either participants or endpoints.
I will take a look and see if I can reproduce.
Hey @libfsw and @MiguelCompany, the same problem also frequently occures on my system (Win10-64bit, FastDDS 2.3.0 and 2.3.3, all nodes on the same machine). I have a bunch of publishers and subscribers running and want to regulary inspect the topics with my ImageViewer-Node. If i start all nodes at a similar time, the ImageViewer-Node is able to find all participants and can receive messages. After some random time and restarts of the Viewer, it is able to find the other participants, sometimes even their topics, but does not receive any message.
I have used the DDS/HelloWorldExample project for testing, made it publish infinitely and added a custom DomainParticipantListener to be able to see what has been found by the Participant.
I did the following tests:
- Making the subscriber crash after n-messages by printing the value of a nullptr (directly after taking the sample from the reader). Then restart the subscriber. Both nodes found each other, found the publisher and matched, but the subscriber node does not receive anything. The only solution is to also restart the publisher.
- Continuously starting the subscriber and closing it after the first messages have been received. It takes 10-150 attempts until i get the same effect as crashing the subscriber, sometimes it still works after 150 attempts which makes it difficult to reproduce.
I have experimented with the discoveryconfig settings, but it did not fix the issue and only reduced its occurrence.
I also found same issue with 2.5.0, if subscriber after crashed or assert, restart the subscriber, it can't receive any topic event anymore. anyone has fixed it?
My current solution is to disable shared memory and only allow the TCPv4 and UDPv4 transport descriptor:
fastdds::dds::DomainParticipantQos pqos;
//Set your other DomainParticipant Qos settings here...
pqos.transport().use_builtin_transports = false;
auto tcp_descriptor = std::make_shared<fastdds::rtps::TCPv4TransportDescriptor>();
pqos.transport().user_transports.push_back(tcp_descriptor);
auto udp_transport = std::make_shared<fastdds::rtps::UDPv4TransportDescriptor>();
pqos.transport().user_transports.push_back(udp_transport);
participant = fastdds::dds::DomainParticipantFactory::get_instance()->create_participant(0, pqos, this, fastdds::dds::StatusMask::none());
With this solution, all my participants do always reconnect and nothing gets corrupt. I did not see any performance downgrades after disabling shared memory, so i'am happy with this setting.
Same issue here, with TOT or 2.6.0 version. I have to use sharedmemory and tested with the suggestion above (many thanks!) and even sharedmemory works with some dropped messages/buffers.
Recently, several improvements related to SHM reconnection have been merged (#3639, #3640, and #3642). @libfsw, @PFrieling and @keith4ever could you please check if the issue reported has been fixed/mitigated?
According to our CONTRIBUTING.md guidelines, I am closing this issue due to inactivity. Please, feel free to reopen it if necessary.