hudi [HUDI-4773] Adding New fitler mode to clustering to filter for recently touched files

Change Logs

Hudi has partition aware clustering strategy and recent partitions based strategy as well for clustering. This plays out well if partitioning is based on dates. but what incase partitioning is based on some other random field.

So, this patch introduces a clustering filter mode to filter based on recently altered files.

For eg, if a user configures clustering to run every 5 commits, every time clustering runs, it will consider only the file groups touched in the last 5 commits. This will avoid triggering repeated clustering for already clustered file groups as well and clustering will be very fast since only delta file groups are considered.

Added a new config named, hoodie.clustering.plan.filter.mode whose possible values are NONE, RECENTLY_UPDATED_FILES and RECENTLY_INSERTED_FILES.

RECENTLY_INSERTED_FILES would also benefit those users who are just trying to sort the records based on some column leveraging clustering. It may not make sense to re-cluster(or re sort) a file group which is already clustered/sorted. So, with this filtering logic, one can filter for those file groups which had inserts in the last N commits whenever clustering gets triggered.

Impact

Describe any public API or user-facing feature change or any performance impact.

Risk level: low/medium

This is a feature or enhancement to clustering which could benefit some users based on their need.

Contributor's checklist

[ ] Read through contributor's guide
[ ] Change Logs and Impact were stated clearly
[ ] Adequate tests were added if applicable
[ ] CI passed

Sep 04 '22 01:09 nsivabalan

CI report:

d3ba36e084b0270252e19932816f6a6acb50fd4e Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

Oct 22 '22 13:10 hudi-bot

A general question: Will this cluster other small files that might be in the partition that was touched that are not part of the current group?

Think of the following example: Timestamped event data arrives from devices once every hour, up to 24 times per day. Data is ingested in a single batch every time the processing is ran -> 24 times per day -> 24 commits a day. 1% of the devices is offline for up to 100 days. The storage has daily partitions.

Over the course of 100 days, the 1% of devices create up to 100 * 24 = 2400 file groups in the partition that is 100 days before today.

With this PR merged, will all of those files be clustered?

Nov 27 '22 22:11 HEPBO3AH