Only leader drops outdated nodes
Description
Fixes #18078 /nocl
Types of changes
- [x] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Refactoring (non-breaking change)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
Checklist:
- [x] My code follows the code style of this project.
- [ ] My change requires a change to the documentation.
- [ ] I have updated the documentation accordingly.
- [x] I have read the CONTRIBUTING document.
- [ ] I have added tests to cover my changes.
@luk-kaminski Please carefully test with a Graylog cluster if this has any negative consequences when there is no leader for like one minute and nodes disappear in that time frame. There might be errors regarding metrics requests and maybe others.
Hi @bernd,
I understand that if only the leader is responsible for nodes removal from the nodes collection, the entries for old nodes may stay longer in that collection in some circumstances.
But that leads to another question: what kind of problem does it cause if we have an entry for old node staying longer in that collection?
Why I am asking: there is an explicit config setting, called stale_leader_timeout, that allows to prolong that delay, allows to keep entries for old nodes longer in the collection. If it is a problem, should we allow it to be changed, increased? If it is not a problem, then should we bother about entries staying in collection for around 60 sec. ?
@luk-kaminski
But that leads to another question: what kind of problem does it cause if we have an entry for old node staying longer in that collection?
That's what we need to find out before we change the node expiration mechanism. :slightly_smiling_face:
Why I am asking: there is an explicit config setting, called
stale_leader_timeout, that allows to prolong that delay, allows to keep entries for old nodes longer in the collection. If it is a problem, should we allow it to be changed, increased? If it is not a problem, then should we bother about entries staying in collection for around 60 sec. ?
The stale_leader_timeout config option has a misleading name. In the AbstractNodeService, the field is called pingTimeout. That would have been a better name for the config option. :smile:
The stale_leader_timeout is the number of milliseconds we wait until we remove a node from MongoDB's nodes collection. It's unrelated to the MongoDB TTL index that we use for leader elections.
We never really changed the default value of the ping timeout (stale_leader_timeout), that's why we are super careful about touching the node expiration mechanism without more research and testing.
@bernd - we have clients that started changing stale_leader_timeout by themselves. Let's have this discussion on slack instead of GH.