[SUPPORT] Hudi partitions not dropped by Hive sync after `insert_overwrite_table` operation
Describe the problem you faced
After running an insert to overwrite a Hudi table inplace using insert_overwrite_table, partitions which no longer exist in the new input data are not removed by the Hive Sync. This causes some query engines to fail until the old partitions are manually removed (e.g. AWS Athena).
This is on Hudi 0.12.1, but I'm fairly sure this issue still exists on 0.13.0 - this change: https://github.com/apache/hudi/pull/6662 fixes this behaviour for delete_partition operations, but doesn't add any handling for insert_overwrite_table.
I'd be happy to be proven otherwise if this is fixed in 0.13.0 - I don't have an environment to easily test this without working out how to upgrade on EMR without a release.
To Reproduce
Steps to reproduce the behavior:
- Create a new Hudi table using input data with two partitions, e.g. partition_col=1, partition_col=2
- Insert into the table using the operation
hoodie.datasource.write.operation=insert_overwrite_tablewith input data containing 1/2 of the original partitions, e.g. only partition_col=2 - Run HiveSyncTool or similar (doesn't work with Spark writer sync or HiveSyncTool)
- Check the Hive partitions. Both partitions still exist
Expected behavior
I'd expect the partition which was not inserted to be removed, e.g. only partition_col=2 exists, partition_col=1 is deleted.
Environment Description
-
Hudi version : 0.12.1
-
Spark version : 3.3.1
-
Hive version : AWS Glue
-
Hadoop version :
-
Storage (HDFS/S3/GCS..) : S3
-
Running on Docker? (yes/no) : no
Additional context
Running on EMR 0.6.9
Thanks for the feedback, guess you are right, this should be supported
Has this problem been solved? @Limess
cc @codope guess this should have been fixed? https://github.com/apache/hudi/pull/6662
Yes this was fixed in 0.13.0
@codope :hello,this issue still exists in version 0.14, why was it closed?
@codope :As stated by the Issue, the problem is a necessary occurrence. The version we are currently using is 0.14. @Limess :Have you not encountered this problem again? May I ask how was it avoided?Thanks!
@zhaobangcai The full context is that the issue was fixed but in order to fix, the archived timeline was also being read. This caused too high sync latency. Hence, the fix was reverted. Generally, reading archived timeline is an anti-pattern in Hudi, and we are optimizing this by implemeting LSM timeline in 1.0.0. That said, I think we did fix the timeline loading in https://github.com/apache/hudi/commit/ab61f61df9686793406300c0018924a119b02855 which I believe is in 0.14. Can you please share a script/test case to reproduce the issue with all configs that you used in your env? I am going to reopen the issue based on your comment and debug further once you provide the script/test case. Thanks.
@codope :As stated by the Issue, the problem is a necessary occurrence. The version we are currently using is 0.14. @Limess :Have you not encountered this problem again? May I ask how was it avoided?Thanks!
We never pursued this and are still on 0.13.0 for now, so I can't verify either way, sorry!