hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT] Incremental cleaning never used during insert

Open parisni opened this issue 3 years ago • 0 comments

hudi 0.11.1

I am working on tables with huge number of partition (> 100k) and almost append only - no update in the past, rarely delete.

Previously I had some issue with cleaning together with bulk-insert : auto-clean was very slow because never found previous cleaning commit and also always do full cleaning of all partitions.

Now I am using insert operation and was expecting no such issue. But I also get that behavior: auto-clean always process every partition in the table.

Moreover, cleaning is way slower with metadata enabled (from 5 minutes w/o metadata to 4 hours w/ metadata enabled), and it get slower when metadata compaction has not been done recently. As a result, auto-clean is not possible in my case together with metadata enabled.

By the way, cleaning has multiple functionality such removing old files, but also repairing the timeline (eg: timeouted commits).

  1. Is incremental cleaning supposed to work that way ?
  2. Can full cleaning w/ metadata performances be improved somehow (for example use filelisting which is faster)

parisni avatar Aug 11 '22 16:08 parisni