Frequent discovery `DATABASE_ERROR` during WiFi brownouts and roaming
Is there an already existing issue for this?
- [X] I have searched the existing issues
Expected behavior
The context:
- A discovery server is running on a Linux host on wired Ethernet
- Additional Linux hosts running DDS applications are running on hosts connected to the same network but via wireless Ethernet (WiFi 6). These hosts are compute boards on autonomous mobile robots and configured with static IP addresses.
- The mobile robots operate in a large area served by many wireless Access Points (AP). As they move in the environment, they routinely switch from one AP to another. The switch-over usually happens in a few 100ms, but sometimes can take a few seconds
Expected behaviour:
- Usually we expect to see nothing at all. All endpoints should continue communicating with each other with a brief interruption during the switch-over.
- If the switch-over takes too long, we expect to see the discovery service 'drop' the participants and endpoints running on WiFi hosts, but re-discover them after the switch-over to new AP is completed. After this points, the publishers and subscribers should match again and data flow should continue
Current behavior
For most part, the behaviour is as we expect (described over). However, occasionally, after the WiFi hosts rejoin network, the discovery service is seen to throw error messages like these:
2023-05-19 07:20:08.146 [DISCOVERY_DATABASE Error] Reader 01.0f.a5.ee.67.f6.82.da.01.00.00.00|0.0.4.7 has no associated participant. Skipping -> Function create_readers_from_change_
2023-05-19 07:20:08.147 [DISCOVERY_DATABASE Error] Reader 01.0f.a5.ee.67.f6.82.da.01.00.00.00|0.0.5.7 has no associated participant. Skipping -> Function create_readers_from_change_
2023-05-19 07:20:08.147 [DISCOVERY_DATABASE Error] Writer 01.0f.a5.ee.67.f6.82.da.01.00.00.00|0.0.1.2 has no associated participant. Skipping -> Function create_writers_from_change_
2023-05-19 07:20:08.147 [DISCOVERY_DATABASE Error] Writer 01.0f.a5.ee.67.f6.82.da.01.00.00.00|0.0.2.2 has no associated participant. Skipping -> Function create_writers_from_change_
Once we see these errors on the discovery service, we notice that discovery is no more reliable. Certain publishers and subscribers may not match anymore and data flow may not ever recover.
Steps to reproduce
As this is an occasional behaviour, this is quite hard to reproduce. The way to reproduce this using multiple virtual machines (VMs) on a host
- Let each VM run DDS applications - perhaps one VM running a publisher and the other running the corresponding subscriber.
- Let one of the VMs run a discovery server.
- Turn the network connectivity off and on continuously on one of the VMs
- Eventually, after a few tries, notice
DATABASE_ERRORreported on the console running discovery server
Fast DDS version/commit
Happens in master. But certainly on release 2.10.1
Platform/Architecture
Ubuntu Focal 20.04 amd64, Ubuntu Focal 20.04 arm64
Transport layer
UDPv4
Additional context
We seem to have solved this by delayed reconciliation of readers and writers reported to have no associated participants. Essentially push such readers and writers into a list, and upon discovery of new participants, we try to run the association again. This seems to resolve the errors. I will add a pull request demonstrating the solution later.
XML configuration file
No response
Relevant log output
No response
Network traffic capture
No response
Associated PR showing code changes here: https://github.com/eProsima/Fast-DDS/pull/3545