hudi [HUDI-9504] support in-memory buffer sort in append write

Change Logs

Add in memory buffer sort in append write function to improve the parquet compression ratio. From our experiment and testing, It can improve 300% compression ratio with right sort key and buffer size configuration.

Impact

User can use the feature by enable the buffer sort configurations

Risk level (write none, low medium or high below)

low

Documentation Update

It is a new feature. Jira will be created to update the website.

Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.

Contributor's checklist

[ ] Read through contributor's guide
[ ] Change Logs and Impact were stated clearly
[ ] Adequate tests were added if applicable
[ ] CI passed

Jun 09 '25 06:06 HuangZhenQiu

@zhangyue19921010 @danny0405 Updated the diff with BinaryInMemorySortBuffer.

Jun 12 '25 03:06 HuangZhenQiu

@zhangyue19921010 @danny0405 Updated the diff with BinaryInMemorySortBuffer.

Will finish my review later this week.

Jun 13 '25 02:06 zhangyue19921010

@HuangZhenQiu Still working on this?

Jul 01 '25 02:07 zhangyue19921010

@zhangyue19921010 Yes. I was OOO last week. Will update the diff this week.

Jul 09 '25 19:07 HuangZhenQiu

Thanks for the review. @cshuo @zhangyue19921010

Ok, will take another look soon.

Jul 14 '25 01:07 cshuo

@cshuo Thanks for the valuable comments. Resolved all of them except the buffer size option. Shall we keep it for the flexibility of users to adopt the feature?

Jul 18 '25 17:07 HuangZhenQiu

@cshuo Thanks for the valuable comments. Resolved all of them except the buffer size option. Shall we keep it for the flexibility of users to adopt the feature?

Thks for updating. Seems the pr don't fix the comments here, i.e., with the current impl, records are partially ordered within a parquet file, since it may contains batches from multiple sortAndSend. We should keep all records within a file strictly ordered to fully leverage the advantages of the sorting.

Jul 21 '25 02:07 cshuo

Small files is not good for query performance. But if we have whole parquet file with order, we will lose the data freshness. Sort time will increase a lot then cause the high back pressure in Flink job. Thus, we use the buffer size to control the row group level order and compression ratio. It is a trade off to achieve data freshness and storage size without keeping parquet file level sort. We will leverage table service to do the stitching later.

Jul 21 '25 03:07 HuangZhenQiu

Small files is not good for query performance.

As mentioned above, we can trigger flushing by buffer memory size and set the size properly to relieve the small files pressure. And the current impl seems can't ensure the data is ordered in row group level either, since row group is switched when it reaches the configured size limit, e.g., default 120Mb currently. (HoodieStorageConfig#PARQUET_BLOCK_SIZE).

But if we have whole parquet file with order, we will lose the data freshness.

Actually the data freshness is decided by checkpoint interval. The writer will flush and commit the written files during checkpoint, until which point the data remains invisible.

Sort time will increase a lot then cause the high back pressure in Flink job.

Agree that it will need more sort time to keep whole file ordered. Not sure how significant the impact is, I remembered @Alowator has a ingestion benchmark which includes sorting of binary buffer here, and said sort performs fast enough so it doesn't affect write performance, where the default batch size is 256Mb to trigger flushing. Maybe you can double check that. cc @HuangZhenQiu

Jul 21 '25 07:07 cshuo

cc @danny0405 @zhangyue19921010 for final review.

Jul 21 '25 07:07 cshuo

CI report:

cc72dd0758dbe12be18fcad9d3823e329f0f1937 UNKNOWN
46637fd4f3e9d1e2c48c25604488e9de4f009e4f UNKNOWN
c692d75fd2c310b2fc4ece6fdbf93b7031ffbe9e UNKNOWN
3f5ee194af7f52bdc8ee57b8d433e5aab3525ad3 UNKNOWN
f27de9281578b1a90147157700e9fa4da9e79593 UNKNOWN
d5210d850a1ceff3474839e6ad153443f7309b00 UNKNOWN
fd80089cab21891021a253b597374b9010bbbc6a UNKNOWN
3c20b563da24ee4734c352e8f049029ebbfd9b69 UNKNOWN
409dcb46ecdedd922ecf3fe8c0478b1c28f62ce0 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

Aug 29 '25 06:08 hudi-bot