helix icon indicating copy to clipboard operation
helix copied to clipboard

Helix behavior change since 1.4.0

Open Jackie-Jiang opened this issue 10 months ago • 10 comments

Describe the bug

There is a behavior change of Helix when upgrading from 1.3.1: When a partition is set to OFFLINE (initial state) in the ideal state, it is no longer showing up in the instance current state, thus not showing up in the external view. With 1.3.1, I can find the following part in the current state ZNRecord:

    "mytable__1__1__20250327T2130Z": {
      "CURRENT_STATE": "OFFLINE"
    },

This is no longer there after upgrading Helix to 1.4.3 I suspect it is related to #2772 but I haven't found the exact code change.

To Reproduce

With custom rebalance mode, create a partition with OFFLINE state, then start the instance.

Expected behavior

Same behavior as 1.3.1

Jackie-Jiang avatar Mar 27 '25 21:03 Jackie-Jiang

@zpinto @junkaixue Can you help take a look? Thanks!

Jackie-Jiang avatar Mar 27 '25 21:03 Jackie-Jiang

Any update on this?

Jackie-Jiang avatar Apr 03 '25 18:04 Jackie-Jiang

Will have a check.

junkaixue avatar Apr 03 '25 20:04 junkaixue

@Jackie-Jiang Wait a min. Are you sure it will have CurrentState before??

Here's the thing: If you directly add a partition with "OFFLINE" state, there is no OFFLINE -> OFFLINE state transition. Helix wont send such kind of message. If there is no state transition, definitely you dont see any partition in current state.

I guess the scenario is like:

  1. Pinot add a partition with ONLINE target state.
  2. It starts bootstrapping and triggered state transition update the current state.
  3. Then you mark that to be OFFLINE.

In this case, current state & external view will has it.

junkaixue avatar Apr 03 '25 21:04 junkaixue

@junkaixue Yes, it will have current state even if there is no state transition needed. You can notice that there is no previous state in the current state.

I actually run over the same thinking process. When I follow the step you mentioned, it shows current as OFFLINE and previous as ONLINE. When I remove the instance and add it back, or directly add OFFLINE partition, it shows current as OFFLINE without previous state. With the new version, when I change ONLINE -> OFFLINE, the current state might exist as the old version does, and it might not exist. When I directly add OFFLINE, the current state doesn't exist.

Jackie-Jiang avatar Apr 04 '25 19:04 Jackie-Jiang

@junkaixue +1 to @Jackie-Jiang Based on what I've observed so far, it looks like the EV entry disappears if the instance is gone, e.g. deleted from LIVEINSTANCES, otherwise it seems to be present and set to OFFLINE (need to experiment more to see if it is always set of OFFLINE or sometimes missing). Earlier behavior is that EV always has an entry.

Is the above behavior intended or is this an issue that needs to be fixed?

somandal avatar Apr 10 '25 17:04 somandal

@junkaixue @zpinto it also looks like there is some behavior change regarding DROPPED state transitions. could this also be related to the empty CURRENTSTATE / EV issue we originally opened this issue for?

We have a scenario where a state transition throws an exception so the partition goes into ERROR state. After that we see that partition get an ERROR -> DROPPED state transition

2025-04-15T16:52:40.1170674Z 16:52:39.924 ERROR [Server_localhost_22001 - SegmentOnlineOfflineStateModel] [HelixTaskExecutor-message_handle_thread_78] SegmentOnlineOfflineStateModel.onBecomeDroppedFromError() : ZnRecord=661f7f6c-5c85-440e-86c5-350adf2ee5ca, {CREATE_TIMESTAMP=1744735959917, ClusterEventName=PeriodicalRebalance, EXECUTE_START_TIMESTAMP=1744735959924, EXE_SESSION_ID=10000195d57000b, FROM_STATE=ERROR, MSG_ID=661f7f6c-5c85-440e-86c5-350adf2ee5ca, MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=mytable__0__0__20250415T1643Z, READ_TIMESTAMP=1744735959921, RESOURCE_NAME=mytable_REALTIME, RESOURCE_TAG=mytable_REALTIME, RETRY_COUNT=3, SRC_NAME=localhost_20000, SRC_SESSION_ID=10000195d570003, STATE_MODEL_DEF=SegmentOnlineOfflineStateModel, STATE_MODEL_FACTORY_NAME=DEFAULT, TGT_NAME=Server_localhost_22001, TGT_SESSION_ID=10000195d57000b, TO_STATE=DROPPED}{}{}, Stat=Stat {_version=0, _creationTime=1744735959918, _modifiedTime=1744735959918, _ephemeralOwner=0}

On debugging I do see it hit this new code:

      // Look through the current state map and add DROPPED message if the instance is not in the
      // resourceStateMap. This instance may not have had been dropped by the rebalance strategy.
      // This check is required to ensure that the instances removed from the ideal state stateMap
      // are properly dropped.
      for (String instance : currentStateMap.keySet()) {
        if (!instanceStateMap.containsKey(instance)) {
          instanceStateMap.put(instance, HelixDefinedState.DROPPED.name());
        }
      }

I've attached some screenshots of the fields in the debugger. In Helix 1.3.1, I see the same scenario in terms of the instanceStateMap being empty for that partition, current state having an entry as ERROR, but a DROPPED transition is never sent since the above code doesn't exist.

Can you folks elaborate more on this behavior change and why? Is this a bug or intended? thanks!

Image Image Image

somandal avatar Apr 15 '25 21:04 somandal

@junkaixue @zpinto More interesting findings based on my last comment. I performed the following steps on a CUSTOMIZED resource:

  • Force some partitions to move to ERROR state in EV via throwing an exception in the StateTransition callback
  • Updated the IdealState for an ONLINE segment to set it to OFFLINE -> to trigger a rebalance loop via IS state change
  • All the ERROR partitions get dropped, they don't seem to come back as ERROR in the EV even after waiting a while
  • All the instances these ERROR partitions are assigned to are still up and running correctly

Maybe I'm misunderstanding the expected behavior, but why are ERROR segments deleted in this scenario and what are the implications on how we should handle and identify such ERROR cases? If we find a partition missing in EV, does this now mean that it might:

  1. Be OFFLINE (confirm with IS that OFFLINE is the expected state - treat this as a no-action) (OFFLINE is our initialState)
  2. Be ERROR (IS is non-OFFLINE)
  3. Not yet be added to the cluster / in the middle of processing the state transition (IS is non-OFFLINE)

Especially, how can we differentiate between 2 and 3 above?

Example state transition message received for DROPPING the ERROR segment:

2025/04/16 09:22:36.878 ERROR [38_7050 - SegmentOnlineOfflineStateModel] [HelixTaskExecutor-message_handle_thread_85] SegmentOnlineOfflineStateModel.onBecomeDroppedFromError() : ZnRecord=abfa46ea-f2d4-4d96-a2a4-734fc8b8e2b5, {CREATE_TIMESTAMP=1744820556842, ClusterEventName=IdealStateChange, EXECUTE_START_TIMESTAMP=1744820556877, EXE_SESSION_ID=10052d688ae0018, FROM_STATE=ERROR, MSG_ID=abfa46ea-f2d4-4d96-a2a4-734fc8b8e2b5, MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=airlineStats_OFFLINE_16081_16081_0, READ_TIMESTAMP=1744820556855, RESOURCE_NAME=airlineStats_OFFLINE, RESOURCE_TAG=airlineStats_OFFLINE, RETRY_COUNT=3, SRC_NAME=100.79.216.38_9000, SRC_SESSION_ID=10052d688ae0005, STATE_MODEL_DEF=SegmentOnlineOfflineStateModel, STATE_MODEL_FACTORY_NAME=DEFAULT, TGT_NAME=Server_100.79.216.38_7050, TGT_SESSION_ID=10052d688ae0018, TO_STATE=DROPPED}{}{}, Stat=Stat {_version=0, _creationTime=1744820556844, _modifiedTime=1744820556844, _ephemeralOwner=0}

cc @Jackie-Jiang

Update:

  • If I modify the IS again to say move the segment from OFFLINE back to ONLINE, I get state transitions for all the segments that had been dropped too, to become ONLINE

Example state transition message received for a previously (ERROR -> DROPPED) partition and now moving to ONLINE:

2025/04/16 09:49:30.579 ERROR [38_7050 - SegmentOnlineOfflineStateModel] [HelixTaskExecutor-message_handle_thread_99] SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline() : ZnRecord=35ba8fc7-10ec-448b-8c7d-b3fa94226450, {CREATE_TIMESTAMP=1744822170542, ClusterEventName=IdealStateChange, EXECUTE_START_TIMESTAMP=1744822170579, EXE_SESSION_ID=10052da59990018, FROM_STATE=OFFLINE, MSG_ID=35ba8fc7-10ec-448b-8c7d-b3fa94226450, MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=airlineStats_OFFLINE_16081_16081_0, READ_TIMESTAMP=1744822170562, RESOURCE_NAME=airlineStats_OFFLINE, RESOURCE_TAG=airlineStats_OFFLINE, RETRY_COUNT=3, SRC_NAME=100.79.216.38_9000, SRC_SESSION_ID=10052da59990005, STATE_MODEL_DEF=SegmentOnlineOfflineStateModel, STATE_MODEL_FACTORY_NAME=DEFAULT, TGT_NAME=Server_100.79.216.38_7050, TGT_SESSION_ID=10052da59990018, TO_STATE=ONLINE}{}{}, Stat=Stat {_version=0, _creationTime=1744822170543, _modifiedTime=1744822170543, _ephemeralOwner=0}

Even based on the above, my earlier questions still hold. Say we don't update the IS for a while, we cannot detect ERROR segments until the next update. Why are ERROR segments dropped in the first place?

somandal avatar Apr 16 '25 16:04 somandal

Had an offline sync with @somandal regarding the partition OFFLINE state, and tried to repro the behavior on both 1.3.1 and 1.4.3, actually I see same behavior.

  1. adding partition in Offline state in IS wont show up in EV, for both version
  2. An partition is OFFLINE on instance_1, stop instance_1 and all state in EV are gone. Reconnect instance_1 and states are back. Behavior is the same for both version. (See the screen shot)
Image

I will strart looking at the error->dropped.

xyuanlu avatar Apr 21 '25 18:04 xyuanlu

hey @xyuanlu any update on when a new release is available with the fix? or if https://github.com/apache/helix/pull/2976 can be hotfixed onto a 1.3 version so we can pick that up instead of trying to upgrade to 1.4.+? thanks!

somandal avatar May 07 '25 23:05 somandal