hudi [MINOR] Add parallel listing of existing partitions

Change Logs

Currently, AWS Glue sync works and provides 2 interfaces - one is HoodieHiveSyncClient using Hive, then Glue -> Hive implementation(hidden by AWS), and another is AWSGlueCatalogSyncClient. Both of them have limitations - although Hive has improved a bit to use pushdown on a big scale still fails and fallback may not work for certain partitioning schemes. For syncing to Glue using HoodieHiveSyncClient, there is a set of limitations:

Create/update is not parallelized under the hood, meaning for big sets it's very slow - empirically it's about 40 partitions/sec MAX, which translates to minutes for bigger scale.
The pushdown filter is not really effective since for 1st case(specifying exact partitions) - it works unpredictably, since the longer the partition value you have, the fewer partitions you can specify, in our case we cannot specify > 100 partitions, therefore it falls back to min-max predicate.
Min-max predicate does not work if the number of partitions is growing with nesting, e.g. on level 1 there are 10, on level 2 there are 100, on level 3 there are 1000. In this case, min-max will cut down high-level ones, but load all levels down, therefore not really making optimization.
When there is e.g. a schema change, Hive-Glue calls cascade, and for big tables it's impossible to sync in meaningful time - although for Glue -> Hudi does not specify schema on partition level, so this is wasted effort. This is why AWSGlueCatalogSyncClient is preferable. But there are other problems with it. Particular list of problems:
Create/Update/Delete were not optimized before - now optimized to be async, but without a meaningful high border, it will simply reach the request limit and stay there. This solution adds a parameter for such parallelism and creates parallelization logic.
Listing all partitions is used always for AWSGlueCatalogSyncClient - this is way suboptimal since the goal of this is to distinguish which of changed-since-last-sync are created and which are deletes, therefore more optimal API can be used - BatchGetPartition. Also, it can be parallelized easily. I added a new method to sync client classes and moved Hive-pushdown into a Hive-specific class and implemented this method for the AWSGlue client class. Also, parameter controlling parallelism is added.
Listing all partitions is suboptimal - it is still needed for initial sync/resync, but it's done in a straightforward way and is suboptimal. In particular - it uses basic nextToken which makes it sequential and works slowly in heavily partitioned tables. AWS has an improvement for this particular method, called segment. This allows us to basically create 1 to 10 start positions and use standard(nextToken) API to list partitions. Also - last public version of Hive-> Glue interface implementation uses it. When we switched from the Hive sync class to AWS Glue specific - first what we faced is performance degradation with the listing. I added segment API parameter usage and added parameter controlling parallelism.

All this has been tested for a partitioned table with >200k partitions. I managed to get speed improvement from 2-3 minutes to 3 seconds. Let me know if you are interested in numbers.

Impact

There is a new config added with the default value from AWS Glue-Hive(parallelism of 5) - while before it was sequential, so I don't expect it to be failing, but we may set parallelism to 1 to be backward-compatible for whatever reasons.

Risk level (write none, low medium or high below)

Low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

Need to adjust the website, but please let me know if you are ok with the changes, I'll proceed then.

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.

Contributor's checklist

[ ] Read through contributor's guide
[ ] Change Logs and Impact were stated clearly
[ ] Adequate tests were added if applicable
[ ] CI passed

Jan 08 '24 17:01 VitoMakarevich

@hudi-bot run azure

Jan 09 '24 10:01 VitoMakarevich

@yihua will you review this if I fix conflicts?

Feb 07 '24 23:02 VitoMakarevich

@yihua @nsivabalan is there any chance you'll be able to take a look on it? It's a significant improvement and makes sync much faster... We've been running it in production for a month already and there are no issues.

Feb 15 '24 22:02 VitoMakarevich

@yihua @nsivabalan

Mar 02 '24 00:03 VitoMakarevich

@VitoMakarevich Thanks for the contribution. Could you rebase the PR on the latest master and resolve the conflicts?

Mar 07 '24 06:03 yihua

cc @VitoMakarevich for updating and I think it is a valuable contribution.

Mar 08 '24 01:03 danny0405

Thanks for the review, will take a look this week.

Mar 14 '24 01:03 VitoMakarevich

@yihua I found that functionality I'm trying to bring heavily collides with https://github.com/apache/hudi/pull/10572. So for the case of loading a list of partitions, there are now the following mechanisms:

Get all - was present always, simply get everything - does not well scale with a lot of partitions - in particular our case can spend 10+ minutes in each commit just getting partitions.
Try to generate the pushdown filter(quite recent addition with improvements recently) - if approximately it can fit 2048 characters - generate a list of all partitions, otherwise get min/max from changed partitions and read within range. This again has an error in 2048 limit - as it depends on partition depth/name length and so on, and min/max is better than nothing but suffers from entropy.
Mechanism I use - simply call batchGetPartition with all partitions changed - scales almost indefinitely.

And basically 1st and 2nd are what is in master branch - but with my approach, we don't need it at all, since get all will be needed only initially(when creating the Glue database), then we can use incremental only. But since it's in master, we may need to have backward compatibility, can you suggest a course forward? I can make a feature flag to use this behavior under certain feature flags - but would like to use it only for HiveSyncTool - where it makes more sense, and for AWS's implementation use my mechanism.

As for the rest - such as improvement of all partitions listing/parallelized update/create/delete - it does not collide with anything. Waiting for your advice, I'm now stuck because of falling IT - but I need to know the direction we go. I just don't wanna delete silently someone's code because it has been introduced recently, because the improvement I bring looks to be better. But likely this is the best way for AWS' specific part.

Mar 14 '24 16:03 VitoMakarevich

@parisni can you take a look at this PR if it addresses your concerns? We are operating 2M+ partition syncs with this code within seconds.

Mar 14 '24 17:03 VitoMakarevich

CI report:

3bbb2f1380bf2b06fb69cc1b08e049ca851f8f18 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

Mar 15 '24 13:03 hudi-bot

@parisni I will add improvements in the following PR. The list I plan to add:

IT as you mentioned
.excludeColumnSchema(true) - although as I remember seeing AWS Glue -there was no schema in partitions, anyway it's one line so I will add it.
Add documentation to the web page so that new parameters are covered. If there is something besides it, please comment, will work on it next week.

Mar 16 '24 11:03 VitoMakarevich

thanks @VitoMakarevich

although as I remember seeing AWS Glue -there was no schema in partitions, anyway it's one line so I will add it.

indeed in glue, any partition is associated with it's own schema. When one create a partition, glue picks up the current table schema and associates it with the partition. Then when you fetch partition, default is to also fetch the associated schema, which adds resource overhead AFAIK

Mar 16 '24 14:03 parisni

@danny0405 small follow-up just to confirm - config changes will be picked up automatically right? I opened asf-site branch and it looks like if I just add a new parameter to an existing class - it should be visible once someone rebuilds a site

Mar 20 '24 22:03 VitoMakarevich

yeah, the web-site would rebuild in minutes.

Mar 21 '24 05:03 danny0405