HIVE-28047: Iceberg: Major QB Compaction with a single commit
What changes were proposed in this pull request?
Improvements to Hive Iceberg QB Major Compaction to perform compaction in one commit instead of two commits as was done till now.
Why are the changes needed?
Existing implementation of compaction creates 2 commits which creates 2 snapshots: first snapshot with all the files deleted and second snapshot with compacted files. If a user queries the table based on snapshot id of the first snapshot, the result would be invalid as no data is present in the table in that snapshot. To avoid this problem this PR is proposed.
Does this PR introduce any user-facing change?
No
Is the change a dependency upgrade?
No
How was this patch tested?
Hive contains 4 query tests for testing Hive Iceberg QB Major Compaction. The outputs of these q-tests were updated as part of this PR.
In my opinion this approach looks more expensive, maybe we should reach out to iceberg community with the. proposal to extend current API with atomic IOW semantic?
I agree it is a little bit more expensive approach, but it doesn't have the correctness issue like the existing approach and this approach is used in another engines like Amoro, Trino and Impala is also in process of changing to this approach.
Quality Gate passed
Issues
0 New issues
Measures
0 Security Hotspots
No data about Coverage
No data about Duplication
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the [email protected] list if the patch is in need of reviews.
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the [email protected] list if the patch is in need of reviews.