iceberg-python Support Snapshot Management Operations

Feature Request / Improvement

Following is a list of operations that are supported in Spark:

rollback_to_snapshot (set_ref_snapshot)
rollback_to_timestamp (set_ref_snapshot)
set_current_snapshot (set_ref_snapshot)
cherrypick_snapshot
publish_changes
fast_forward

set_ref_snapshot support will be introduced, but it would also be nice to add snapshot migration operations that generate new snapshots from existing ones

May 14 '24 16:05 sungwy

@syun64 I have a few questions about the operations and I couldn't find more info in the docs. Apologies if these have been answered elsewhere.

How does rollback_to_timestamp use set_ref_snapshot()? In the rollback_to_timestamp documentation, inputs are Table and timestamp, and I can't find a snapshot_by_timestamp() api to get the snapshot_id.
for cherrypick_snapshot and publish_changes (docs), wouldn't we need an add_snapshot() table api ? I noticed both add_snapshot() and set_ref_snapshot() were removed in the same PR. Do we bring back add_snapshot() as well?

May 15 '24 17:05 chinmay-bhat

How does rollback_to_timestamp use set_ref_snapshot()? In the rollback_to_timestamp documentation, inputs are Table and timestamp, and I can't find a snapshot_by_timestamp() api to get the snapshot_id.

Yeah I think it would be helpful to introduce a snapshot_by_timestamp utility function to get the snapshot - just like you mentioned, that would help recover feature parity with the existing Java API

for cherrypick_snapshot and publish_changes (docs), wouldn't we need an add_snapshot() table api ? I noticed both add_snapshot() and set_ref_snapshot() were removed in the same PR. Do we bring back add_snapshot() as well?

I don't think add_snapshot needs to be an API because the function is incredibly simple, in that it just adds a AddSnapshotUpdate table update and AssertTableUUID requirement. I think instead, functions like cherrypick_snapshot, publish_changes should be separate APIs that builds the new snapshot and then makes a commit with the updated snapshot. WDYT?

May 15 '24 17:05 sungwy

Yeah I think it would be helpful to introduce a snapshot_by_timestamp utility function to get the snapshot - just like you mentioned, that would help recover feature parity with the existing Java API

Yeah, looking at the Java api, something like findLatestAncestorOlderThan makes sense.

WDYT?

Sounds good to me!

May 15 '24 17:05 chinmay-bhat

Hi @syun64 @chinmay-bhat. Thank you for driving these features. I saw some PRs raised by @chinmay-bhat : #728, #748, #750, #758. Those are great!

I would like to discuss what APIs we want to expose to users. I am actually leaning toward removing set_ref_snapshot from public API. The reason is that set_ref_snapshot itself requires too many details from users and thus error-prone. It shall sit behind APIs like rollback_to_snapshot, create_tag, cherryPick which have more specific functionality and simpler to use.

We may also want to group these APIs into a class (e.g. ManageSnapshots), like the java implementation. So the resulting user experience with these APIs will be

with table.transaction() as txn:
    ... other updates
    with txn.manage_snapshots() as ms:
       ms.createTag("Tag_A", 0)
       ms.createTag("Tag_B", 1)

We could also keep table.manage_snapshots() for convenience. Similar to the usage of UpdateSchema.

What do you think about this? Appreciate any thoughts and suggestions!

Also kindly looping @Fokko to this thread.

May 22 '24 02:05 HonahX

@HonahX thank you for your response. I agree that we should hide the set_ref_snapshot from the public API. I also like the idea of creating a ManageSnapshots inner class in Transaction to organise the APIs, while exposing them through Transaction and Table.

May 22 '24 09:05 chinmay-bhat