incubator-xtable icon indicating copy to clipboard operation
incubator-xtable copied to clipboard

Properly reflect rollbacks/restores in target tables

Open the-other-tim-brown opened this issue 2 years ago • 9 comments

Right now when we see a rollback or restore in the source table, we just treat it as files being removed from the table. We should update this to instead issue a rollback command in the target tables so that the histories are more consistent between the source and target.

the-other-tim-brown avatar Sep 27 '23 13:09 the-other-tim-brown

What is the source and target in this scenario ? Actually i am trying to understand the code change so following the jira issues as well might ask silly question.

gzagarwal avatar Mar 15 '24 16:03 gzagarwal

@gzagarwal The idea here is that the source can be any of the supported sources and target is any of the supported targets. The vision was that a rollback/restore to a previous point in time or commit would trigger the same in the target format if possible (fallback to current behavior of computing files to add/remove to the target format's view)

the-other-tim-brown avatar Mar 25 '24 15:03 the-other-tim-brown

Hi @the-other-tim-brown @gzagarwal I’m interested in working on this feature :)

danielhumanmod avatar Oct 14 '24 09:10 danielhumanmod

And I have a question after doing some initial investigation about how to handle source table.

Based on my understanding, our sync() process is externally controlled (like time-based or event-driven), so each sync might capture multiple operations on the table. For formats like Iceberg (via snapshot ID), detecting a rollback is straightforward. However, with Delta and Hudi, it becomes more complex. Delta relies on a log-based system, and Hudi on a time-based model—both of which may involve several operations (commit, add, delete, rollback/restore) between syncs. This makes rollback detection more difficult, especially if multiple operations have occurred since the rollback. In such cases, maybe we still want treat changes as simple add/delete operations, as we do now, if mixed operation types are involved?

These are just my initial thoughts based on my investigation, and I may be missing something. I would appreciate any suggestions or input you might have!

danielhumanmod avatar Oct 14 '24 09:10 danielhumanmod

@danielhumanmod what you've described is how we're currently handling the rollbacks/restores but I am thinking it may be less computationally expensive if we can just restore to a particular point in time in the table instead of computing a large diff with the current state of the table.

the-other-tim-brown avatar Oct 16 '24 01:10 the-other-tim-brown

Thanks for the clarification @the-other-tim-brown ! Based on the discussion, my current idea is:

  1. Iceberg: Identify the current snapshot ID, and if a rollback is detected, directly restore to that version.
  2. Delta: Trace the commit logs to find the latest restore operation, restore to that version, and then apply any remaining changes (e.g., adds/removals after the restore).
  3. Hudi: Similar to Delta, identify the latest rollback, restore to that version, and apply the remaining changes.

Does this approach align with your thoughts?

danielhumanmod avatar Oct 16 '24 02:10 danielhumanmod

Thanks for the clarification @the-other-tim-brown ! Based on the discussion, my current idea is:

  1. Iceberg: Identify the current snapshot ID, and if a rollback is detected, directly restore to that version.
  2. Delta: Trace the commit logs to find the latest restore operation, restore to that version, and then apply any remaining changes (e.g., adds/removals after the restore).
  3. Hudi: Similar to Delta, identify the latest rollback, restore to that version, and apply the remaining changes.

Does this approach align with your thoughts?

Yes it does

the-other-tim-brown avatar Oct 21 '24 00:10 the-other-tim-brown

Hi @the-other-tim-brown, based on the idea we discussed above, my plan is dividing this feature into two PRs:

  1. Detect rollback in the source and target formats (Draft PR #569)
  2. Sync logic (includes snapshot-based sync and incremental sync)

I’ve completed a proof of concept for the first part and would like to discuss a few points with you before proceeding with further implementation. My main concern is that the fallback might happen frequently in cases where the source and target are not synced often. I’ve explained the root cause and included an example in the PR. Could you review the high-level idea in #569 and let me know if this approach is acceptable to you?

danielhumanmod avatar Oct 28 '24 02:10 danielhumanmod

@danielhumanmod I will take a look today or tomorrow. Apologies for the delay on my end.

the-other-tim-brown avatar Nov 02 '24 12:11 the-other-tim-brown