hudi [SUPPORT] Hudi partitions not dropped by Hive sync after `insert_overwrite

Describe the problem you faced

After running an insert to overwrite a Hudi table inplace using insert_overwrite_table, partitions which no longer exist in the new input data are not removed by the Hive Sync. This causes some query engines to fail until the old partitions are manually removed (e.g. AWS Athena).

This is on Hudi 0.12.1, but I'm fairly sure this issue still exists on 0.13.0 - this change: https://github.com/apache/hudi/pull/6662 fixes this behaviour for delete_partition operations, but doesn't add any handling for insert_overwrite_table.

I'd be happy to be proven otherwise if this is fixed in 0.13.0 - I don't have an environment to easily test this without working out how to upgrade on EMR without a release.

To Reproduce

Steps to reproduce the behavior:

Create a new Hudi table using input data with two partitions, e.g. partition_col=1, partition_col=2
Insert into the table using the operation hoodie.datasource.write.operation=insert_overwrite_table with input data containing 1/2 of the original partitions, e.g. only partition_col=2
Run HiveSyncTool or similar (doesn't work with Spark writer sync or HiveSyncTool)
Check the Hive partitions. Both partitions still exist

Expected behavior

I'd expect the partition which was not inserted to be removed, e.g. only partition_col=2 exists, partition_col=1 is deleted.

Environment Description

Hudi version : 0.12.1
Spark version : 3.3.1
Hive version : AWS Glue
Hadoop version :
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no

Additional context

Running on EMR 0.6.9

Mar 07 '23 16:03 Limess

Thanks for the feedback, guess you are right, this should be supported

Mar 08 '23 07:03 danny0405

Has this problem been solved? @Limess

Apr 29 '24 03:04 donghaihu

cc @codope guess this should have been fixed? https://github.com/apache/hudi/pull/6662

Apr 29 '24 05:04 danny0405

Yes this was fixed in 0.13.0

Apr 29 '24 06:04 codope

@codope ：hello,this issue still exists in version 0.14, why was it closed?

Jun 24 '24 01:06 donghaihu

@codope :As stated by the Issue, the problem is a necessary occurrence. The version we are currently using is 0.14. @Limess :Have you not encountered this problem again? May I ask how was it avoided?Thanks!

Jun 24 '24 02:06 donghaihu

@zhaobangcai The full context is that the issue was fixed but in order to fix, the archived timeline was also being read. This caused too high sync latency. Hence, the fix was reverted. Generally, reading archived timeline is an anti-pattern in Hudi, and we are optimizing this by implemeting LSM timeline in 1.0.0. That said, I think we did fix the timeline loading in https://github.com/apache/hudi/commit/ab61f61df9686793406300c0018924a119b02855 which I believe is in 0.14. Can you please share a script/test case to reproduce the issue with all configs that you used in your env? I am going to reopen the issue based on your comment and debug further once you provide the script/test case. Thanks.

Jun 24 '24 03:06 codope

@codope :As stated by the Issue, the problem is a necessary occurrence. The version we are currently using is 0.14. @Limess :Have you not encountered this problem again? May I ask how was it avoided?Thanks!

We never pursued this and are still on 0.13.0 for now, so I can't verify either way, sorry!

Jun 24 '24 07:06 Limess

[SUPPORT] Hudi partitions not dropped by Hive sync after `insert_overwrite_table` operation