hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT] Hudi partitions not dropped by Hive sync after `insert_overwrite_table` operation

Open Limess opened this issue 2 years ago • 8 comments

Describe the problem you faced

After running an insert to overwrite a Hudi table inplace using insert_overwrite_table, partitions which no longer exist in the new input data are not removed by the Hive Sync. This causes some query engines to fail until the old partitions are manually removed (e.g. AWS Athena).

This is on Hudi 0.12.1, but I'm fairly sure this issue still exists on 0.13.0 - this change: https://github.com/apache/hudi/pull/6662 fixes this behaviour for delete_partition operations, but doesn't add any handling for insert_overwrite_table.

I'd be happy to be proven otherwise if this is fixed in 0.13.0 - I don't have an environment to easily test this without working out how to upgrade on EMR without a release.

To Reproduce

Steps to reproduce the behavior:

  1. Create a new Hudi table using input data with two partitions, e.g. partition_col=1, partition_col=2
  2. Insert into the table using the operation hoodie.datasource.write.operation=insert_overwrite_table with input data containing 1/2 of the original partitions, e.g. only partition_col=2
  3. Run HiveSyncTool or similar (doesn't work with Spark writer sync or HiveSyncTool)
  4. Check the Hive partitions. Both partitions still exist

Expected behavior

I'd expect the partition which was not inserted to be removed, e.g. only partition_col=2 exists, partition_col=1 is deleted.

Environment Description

  • Hudi version : 0.12.1

  • Spark version : 3.3.1

  • Hive version : AWS Glue

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : no

Additional context

Running on EMR 0.6.9

Limess avatar Mar 07 '23 16:03 Limess

Thanks for the feedback, guess you are right, this should be supported

danny0405 avatar Mar 08 '23 07:03 danny0405

Has this problem been solved? @Limess

donghaihu avatar Apr 29 '24 03:04 donghaihu

cc @codope guess this should have been fixed? https://github.com/apache/hudi/pull/6662

danny0405 avatar Apr 29 '24 05:04 danny0405

Yes this was fixed in 0.13.0

codope avatar Apr 29 '24 06:04 codope

@codope :hello,this issue still exists in version 0.14, why was it closed?

donghaihu avatar Jun 24 '24 01:06 donghaihu

@codope :As stated by the Issue, the problem is a necessary occurrence. The version we are currently using is 0.14. @Limess :Have you not encountered this problem again? May I ask how was it avoided?Thanks!

donghaihu avatar Jun 24 '24 02:06 donghaihu

@zhaobangcai The full context is that the issue was fixed but in order to fix, the archived timeline was also being read. This caused too high sync latency. Hence, the fix was reverted. Generally, reading archived timeline is an anti-pattern in Hudi, and we are optimizing this by implemeting LSM timeline in 1.0.0. That said, I think we did fix the timeline loading in https://github.com/apache/hudi/commit/ab61f61df9686793406300c0018924a119b02855 which I believe is in 0.14. Can you please share a script/test case to reproduce the issue with all configs that you used in your env? I am going to reopen the issue based on your comment and debug further once you provide the script/test case. Thanks.

codope avatar Jun 24 '24 03:06 codope

@codope :As stated by the Issue, the problem is a necessary occurrence. The version we are currently using is 0.14. @Limess :Have you not encountered this problem again? May I ask how was it avoided?Thanks!

We never pursued this and are still on 0.13.0 for now, so I can't verify either way, sorry!

Limess avatar Jun 24 '24 07:06 Limess