ExternalViewChangeListener not getting called randomly for big cluster size.
Describe the bug
We are having a few hundred listeners as spectators for the helix cluster and thousand+ of participants in the cluster. But when there is a change in the cluster, not all the listeners are getting called. sometimes, 5-10 are getting called, sometimes none, and the majority of the time all the spectators are getting called. Using ExternalViewChangeListener to register
The callback method onExternalViewChange
To Reproduce
Having a huge Cluster size affects the callback of the ExternalViewChangeListener
Expected behavior
All the spectators should get a callback for all the events.
Additional context
It spectators are unreliable because of this bug.
Hello @vaidhyanathan-ananthakrishnan-13 - can you elaborate on what is the size of the cluster, #resources etc this will help us simulate
@desaikomal Sorry wrongly closed the issue.
The size of the cluster is ~2500 instances and the listener is around ~1400. We are having two kinds of listener, one is 1400 another is ~100.
ExternalViewChangeListener is around ~1400 instances. but for that, This issue happening.
If I remember correctly, The same code base with the same cluster works fine, but over time, had to add a lot of instances, after some time started seeing this error. I am not 100% accurate about whether it worked perfectly fine before or not, but now days see a lot more issues.
@desaikomal
Does any known issue exist for this case? Or is it expected to add some kind of fallback mechanism for this(Not sure how to do it effectively without polling the state)?
not that i am aware of. i am relatively new and so not sure if others have seen this issue before.
@vaidhyanathan-ananthakrishnan-13 Yes, this has been observed, but it's more that even though the cluster state has been updated, we're sometimes missing the corresponding updates to ExternalView (aka ExternalView doesn't get updated properly), which would lead to ExternalViewChangeListener not getting called (Helix controller fails to update ExternalView properly). We have seen this in large-scale clusters.
We are tracking this issue, but feel free to add any findings or suggestions here.
Thanks, @narendly ,
I will try to find out why the ExternalView is not getting updated.
Meanwhile, as a workaround, Is there any way we can know about the cluster state deterministically from the spectator?
One thought, I had is to add the addLiveInstanceChangeListener and get the cluster state from the spectator, but I am not sure really sure how to get the cluster state without ExternalView, Any help here?
Or Is there any known workaround there for this ticket?