rust-libp2p icon indicating copy to clipboard operation
rust-libp2p copied to clipboard

fix(kad): correctly handle `NoKnownPeers` error when bootstrap

Open stormshield-frb opened this issue 1 year ago • 2 comments

Description

After testing master, we encountered a bug due to #4838 when doing automatic or periodic bootstrap if the node has no known peers.

Since it failed immediately, I though there was no need to call the bootstrap_status.on_started method. But no doing so never reset the periodic timer inside bootstrap_status resulting in getting stuck to try to bootstrap every time poll is called on kad::Behaviour.

Notes & open questions

N/A

Change checklist

  • [x] I have performed a self-review of my own code
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • [x] A changelog entry has been made in the appropriate crates

stormshield-frb avatar Apr 30 '24 13:04 stormshield-frb

This looks a bit hacky, wouldn't it be better to modify the bootstrap Status instead (e.g poll_next_bootstrap)?

guillaumemichel avatar Apr 30 '24 14:04 guillaumemichel

This looks a bit hacky, wouldn't it be better to modify the bootstrap Status instead (e.g poll_next_bootstrap)?

I'm not sure to understand what you mean. on_started and on_finished are intended for that purpose.

Even if I would update the Status directly, we would not be able to remove on_started completely since the end user could still manually trigger a bootstrap, and we would not be able to remove on_finish at all since there is currently no way to detect a bootstrap has finished outside exploring query_finished or query_timeout. And since a bootstrap can also fail immediately, we have to handle that there.

I agree that it would feel better to have a passive way to learn that a bootstrap did start or finish but I don't see how to implement that in a reasonably simple manner.

The reason we need to know if a bootstrap as started or finished is because we don't want to cascade bootstrap requests. When a bootstrap is triggered (no matter if it was automatic, periodic or manual), we reset the automatic and periodic timer to their initial value.

stormshield-frb avatar May 02 '24 07:05 stormshield-frb

Not checking whether the routing table is empty would take us back to before this commit. IMO it is good that this check was introduced, allowing the bootstrap to fail fast.

If the check isn't performed, a new query for self is created, and the first time next() is called, the query state is immediately set to Finished. IMO it would be better to have an error message saying that the bootstrap failed because the routing table is empty, rather than looking for an empty OutboundQueryProgressed.

guillaumemichel avatar Jun 11 '24 12:06 guillaumemichel

Not checking whether the routing table is empty would take us back to before this commit. IMO it is good that this check was introduced, allowing the bootstrap to fail fast.

If the check isn't performed, a new query for self is created, and the first time next() is called, the query state is immediately set to Finished. IMO it would be better to have an error message saying that the bootstrap failed because the routing table is empty, rather than looking for an empty OutboundQueryProgressed.

I'm not sure about that. If a check is needed when the routing table is empty, why is it not done for the other queries like get_closest_peers, get_record or get_providers ? In those cases, a Query is always started however it is possible no actual query is emitted on the wire. Sure get_record and get_providers can have information from their local cache, but get_closest_peers does not.

stormshield-frb avatar Jun 11 '24 12:06 stormshield-frb

I agree that the empty routing table check should be consistent with other queries, and it would even be better if the check was unified (e.g when creating the query?). But this can be left for a follow up PR.

guillaumemichel avatar Jun 11 '24 12:06 guillaumemichel

This pull request has merge conflicts. Could you please resolve them @stormshield-frb? 🙏

mergify[bot] avatar Jun 12 '24 18:06 mergify[bot]

@stormshield-frb what do you think of https://github.com/libp2p/rust-libp2p/pull/5349/commits/b548fc72f930182cf8c151b49d0eac43ae586bae? IMO it is a slightly cleaner, because if the routing table is empty, we simply reset the timers, and we don't need to increase and then decrease the count of bootstrap requests, and we don't need to wake the waker.

If you disagree, we can revert to the last commit.

guillaumemichel avatar Jun 18 '24 11:06 guillaumemichel