Frequent discovery `DATABASE_ERROR` during WiFi brownouts and roaming

Open cvilas opened this issue 2 years ago • 1 comments

Is there an already existing issue for this?

[X] I have searched the existing issues

Expected behavior

The context:

A discovery server is running on a Linux host on wired Ethernet
Additional Linux hosts running DDS applications are running on hosts connected to the same network but via wireless Ethernet (WiFi 6). These hosts are compute boards on autonomous mobile robots and configured with static IP addresses.
The mobile robots operate in a large area served by many wireless Access Points (AP). As they move in the environment, they routinely switch from one AP to another. The switch-over usually happens in a few 100ms, but sometimes can take a few seconds

Expected behaviour:

Usually we expect to see nothing at all. All endpoints should continue communicating with each other with a brief interruption during the switch-over.
If the switch-over takes too long, we expect to see the discovery service 'drop' the participants and endpoints running on WiFi hosts, but re-discover them after the switch-over to new AP is completed. After this points, the publishers and subscribers should match again and data flow should continue

Current behavior

For most part, the behaviour is as we expect (described over). However, occasionally, after the WiFi hosts rejoin network, the discovery service is seen to throw error messages like these:

2023-05-19 07:20:08.146 [DISCOVERY_DATABASE Error] Reader 01.0f.a5.ee.67.f6.82.da.01.00.00.00|0.0.4.7 has no associated participant. Skipping -> Function create_readers_from_change_
2023-05-19 07:20:08.147 [DISCOVERY_DATABASE Error] Reader 01.0f.a5.ee.67.f6.82.da.01.00.00.00|0.0.5.7 has no associated participant. Skipping -> Function create_readers_from_change_
2023-05-19 07:20:08.147 [DISCOVERY_DATABASE Error] Writer 01.0f.a5.ee.67.f6.82.da.01.00.00.00|0.0.1.2 has no associated participant. Skipping -> Function create_writers_from_change_
2023-05-19 07:20:08.147 [DISCOVERY_DATABASE Error] Writer 01.0f.a5.ee.67.f6.82.da.01.00.00.00|0.0.2.2 has no associated participant. Skipping -> Function create_writers_from_change_

Once we see these errors on the discovery service, we notice that discovery is no more reliable. Certain publishers and subscribers may not match anymore and data flow may not ever recover.

Steps to reproduce

As this is an occasional behaviour, this is quite hard to reproduce. The way to reproduce this using multiple virtual machines (VMs) on a host

Let each VM run DDS applications - perhaps one VM running a publisher and the other running the corresponding subscriber.
Let one of the VMs run a discovery server.
Turn the network connectivity off and on continuously on one of the VMs
Eventually, after a few tries, notice DATABASE_ERROR reported on the console running discovery server

Fast DDS version/commit

Happens in master. But certainly on release 2.10.1

Platform/Architecture

Ubuntu Focal 20.04 amd64, Ubuntu Focal 20.04 arm64

Transport layer

UDPv4

Additional context

We seem to have solved this by delayed reconciliation of readers and writers reported to have no associated participants. Essentially push such readers and writers into a list, and upon discovery of new participants, we try to run the association again. This seems to resolve the errors. I will add a pull request demonstrating the solution later.

XML configuration file

No response

Relevant log output

No response

Network traffic capture

No response

May 28 '23 11:05 cvilas

Associated PR showing code changes here: https://github.com/eProsima/Fast-DDS/pull/3545

May 28 '23 11:05 cvilas