Support Snapshot Management Operations
Feature Request / Improvement
Following is a list of operations that are supported in Spark:
- rollback_to_snapshot (set_ref_snapshot)
- rollback_to_timestamp (set_ref_snapshot)
- set_current_snapshot (set_ref_snapshot)
- cherrypick_snapshot
- publish_changes
- fast_forward
set_ref_snapshot support will be introduced, but it would also be nice to add snapshot migration operations that generate new snapshots from existing ones
@syun64 I have a few questions about the operations and I couldn't find more info in the docs. Apologies if these have been answered elsewhere.
-
How does rollback_to_timestamp use
set_ref_snapshot()? In the rollback_to_timestamp documentation, inputs are Table and timestamp, and I can't find asnapshot_by_timestamp()api to get the snapshot_id. -
for cherrypick_snapshot and publish_changes (docs), wouldn't we need an
add_snapshot()table api ? I noticed bothadd_snapshot()andset_ref_snapshot()were removed in the same PR. Do we bring backadd_snapshot()as well?
- How does rollback_to_timestamp use
set_ref_snapshot()? In the rollback_to_timestamp documentation, inputs are Table and timestamp, and I can't find asnapshot_by_timestamp()api to get the snapshot_id.
- Yeah I think it would be helpful to introduce a
snapshot_by_timestamputility function to get the snapshot - just like you mentioned, that would help recover feature parity with the existing Java API
I don't think add_snapshot needs to be an API because the function is incredibly simple, in that it just adds a AddSnapshotUpdate table update and AssertTableUUID requirement. I think instead, functions like cherrypick_snapshot, publish_changes should be separate APIs that builds the new snapshot and then makes a commit with the updated snapshot. WDYT?
Yeah I think it would be helpful to introduce a snapshot_by_timestamp utility function to get the snapshot - just like you mentioned, that would help recover feature parity with the existing Java API
Yeah, looking at the Java api, something like findLatestAncestorOlderThan makes sense.
WDYT?
Sounds good to me!
Hi @syun64 @chinmay-bhat. Thank you for driving these features. I saw some PRs raised by @chinmay-bhat : #728, #748, #750, #758. Those are great!
I would like to discuss what APIs we want to expose to users. I am actually leaning toward removing set_ref_snapshot from public API. The reason is that set_ref_snapshot itself requires too many details from users and thus error-prone. It shall sit behind APIs like rollback_to_snapshot, create_tag, cherryPick which have more specific functionality and simpler to use.
We may also want to group these APIs into a class (e.g. ManageSnapshots), like the java implementation. So the resulting user experience with these APIs will be
with table.transaction() as txn:
... other updates
with txn.manage_snapshots() as ms:
ms.createTag("Tag_A", 0)
ms.createTag("Tag_B", 1)
We could also keep table.manage_snapshots() for convenience. Similar to the usage of UpdateSchema.
What do you think about this? Appreciate any thoughts and suggestions!
Also kindly looping @Fokko to this thread.
@HonahX thank you for your response.
I agree that we should hide the set_ref_snapshot from the public API. I also like the idea of creating a ManageSnapshots inner class in Transaction to organise the APIs, while exposing them through Transaction and Table.