js-libp2p icon indicating copy to clipboard operation
js-libp2p copied to clipboard

Dial queue timeout for multiple multiaddresses incorrectly used for each connection separately

Open akim-bow opened this issue 2 years ago • 10 comments

Version: [email protected]

Severity:

Medium

Description:

If remote peer proposes multiple multiaddresses, e.g. [/dns/nox-1, /ip4/10.50.10.10, /ip4/127.0.0.1] (only the last one is reachable by default), libp2p will try to connect to them one by one. As you can see in the source code, the shared signal is created and used in every attempt to dial multiaddresses. Hence, the first address could take up all the time until the signal aborts and other addresses would be dropped without an attempt to connect to them.

I'm proposing to use 2 separate signals - first one for batch of multiaddresses, the second one for each multiaddress separately. It will solve the case when a peer provides multiple adresses and only some of them are actually valid. If the description isn't clear enough i can try to build simple reproduction example.

akim-bow avatar Jan 19 '24 18:01 akim-bow

@achingbrain can you please look into it?

akim-bow avatar Feb 08 '24 11:02 akim-bow

@achingbrain could you take a look at this ? Most auto dials are failing because of it

zeroxbt avatar Mar 05 '24 01:03 zeroxbt

@akim-bow what kind of addresses are you having problems with?

The example you gave was [/dns/nox-1, /ip4/10.50.10.10, /ip4/127.0.0.1] - DNS addresses take time to resolve, so that can be bad, fair enough, but the other two are private local addresses so assuming TCP, they should be fast to fail?

If you know certain addresses are going to be unreliable or slow, you can also pass an addressSorter as part of the connectionManager config to ensure they are dialled last, or a connectionGater to ensure they are not dialled at all.

achingbrain avatar Mar 06 '24 14:03 achingbrain

@achingbrain I'm having the same issue when running nodes in local environment.

When dialing a peer, these are the addresses found in the peer store:

[/ip4/10.2.0.2/tcp/9106, /ip4/127.0.0.1/tcp/9106, /ip4/172.20.5.94/tcp/9106]

The first address is not reachable, while the other two are. The dialer fails after dialTimeout milliseconds because of the first address, without ever attempting to dial the other two. All auto dials are also failing for the same reason.

zeroxbt avatar Mar 06 '24 22:03 zeroxbt

My concern with the proposed solution is that having separate timeouts for individual addresses could lead to very long dial times.

I wonder if we could be smarter with the dialling, for example if the IP address of the target multiaddr is a class A private network address, deprioritise or entirely skip dialling it unless the current node also has a class A private network address in the same logical network?

achingbrain avatar Mar 07 '24 15:03 achingbrain

Although sorting and filtering addresses is a good option, as a user I would still expect the dial function to attempt dialing all provided addresses. Why not give users the option to choose a maxParallelDialsPerPeer parameter with a default value of 1 ?

zeroxbt avatar Mar 07 '24 15:03 zeroxbt