helix icon indicating copy to clipboard operation
helix copied to clipboard

ExternalViewChangeListener not getting called randomly for big cluster size.

Open vaidhyanathan-ananthakrishnan-13 opened this issue 3 years ago • 6 comments

Describe the bug

We are having a few hundred listeners as spectators for the helix cluster and thousand+ of participants in the cluster. But when there is a change in the cluster, not all the listeners are getting called. sometimes, 5-10 are getting called, sometimes none, and the majority of the time all the spectators are getting called. Using ExternalViewChangeListener to register

The callback method onExternalViewChange

To Reproduce

Having a huge Cluster size affects the callback of the ExternalViewChangeListener

Expected behavior

All the spectators should get a callback for all the events.

Additional context

It spectators are unreliable because of this bug.

Hello @vaidhyanathan-ananthakrishnan-13 - can you elaborate on what is the size of the cluster, #resources etc this will help us simulate

desaikomal avatar Apr 18 '22 14:04 desaikomal

@desaikomal Sorry wrongly closed the issue.

The size of the cluster is ~2500 instances and the listener is around ~1400. We are having two kinds of listener, one is 1400 another is ~100.

ExternalViewChangeListener is around ~1400 instances. but for that, This issue happening.

If I remember correctly, The same code base with the same cluster works fine, but over time, had to add a lot of instances, after some time started seeing this error. I am not 100% accurate about whether it worked perfectly fine before or not, but now days see a lot more issues.

@desaikomal

Does any known issue exist for this case? Or is it expected to add some kind of fallback mechanism for this(Not sure how to do it effectively without polling the state)?

not that i am aware of. i am relatively new and so not sure if others have seen this issue before.

desaikomal avatar Apr 20 '22 18:04 desaikomal

@vaidhyanathan-ananthakrishnan-13 Yes, this has been observed, but it's more that even though the cluster state has been updated, we're sometimes missing the corresponding updates to ExternalView (aka ExternalView doesn't get updated properly), which would lead to ExternalViewChangeListener not getting called (Helix controller fails to update ExternalView properly). We have seen this in large-scale clusters.

We are tracking this issue, but feel free to add any findings or suggestions here.

narendly avatar Apr 20 '22 18:04 narendly

Thanks, @narendly ,

I will try to find out why the ExternalView is not getting updated.

Meanwhile, as a workaround, Is there any way we can know about the cluster state deterministically from the spectator?

One thought, I had is to add the addLiveInstanceChangeListener and get the cluster state from the spectator, but I am not sure really sure how to get the cluster state without ExternalView, Any help here?

Or Is there any known workaround there for this ticket?