[HUDI-9504] support in-memory buffer sort in append write
Change Logs
Add in memory buffer sort in append write function to improve the parquet compression ratio. From our experiment and testing, It can improve 300% compression ratio with right sort key and buffer size configuration.
Impact
User can use the feature by enable the buffer sort configurations
Risk level (write none, low medium or high below)
low
Documentation Update
It is a new feature. Jira will be created to update the website.
- Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.
Contributor's checklist
- [ ] Read through contributor's guide
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
@zhangyue19921010 @danny0405 Updated the diff with BinaryInMemorySortBuffer.
@zhangyue19921010 @danny0405 Updated the diff with BinaryInMemorySortBuffer.
Will finish my review later this week.
@HuangZhenQiu Still working on this?
@zhangyue19921010 Yes. I was OOO last week. Will update the diff this week.
Thanks for the review. @cshuo @zhangyue19921010
Ok, will take another look soon.
@cshuo Thanks for the valuable comments. Resolved all of them except the buffer size option. Shall we keep it for the flexibility of users to adopt the feature?
@cshuo Thanks for the valuable comments. Resolved all of them except the buffer size option. Shall we keep it for the flexibility of users to adopt the feature?
Thks for updating. Seems the pr don't fix the comments here, i.e., with the current impl, records are partially ordered within a parquet file, since it may contains batches from multiple sortAndSend. We should keep all records within a file strictly ordered to fully leverage the advantages of the sorting.
Small files is not good for query performance. But if we have whole parquet file with order, we will lose the data freshness. Sort time will increase a lot then cause the high back pressure in Flink job. Thus, we use the buffer size to control the row group level order and compression ratio. It is a trade off to achieve data freshness and storage size without keeping parquet file level sort. We will leverage table service to do the stitching later.
Small files is not good for query performance.
As mentioned above, we can trigger flushing by buffer memory size and set the size properly to relieve the small files pressure. And the current impl seems can't ensure the data is ordered in row group level either, since row group is switched when it reaches the configured size limit, e.g., default 120Mb currently. (HoodieStorageConfig#PARQUET_BLOCK_SIZE).
But if we have whole parquet file with order, we will lose the data freshness.
Actually the data freshness is decided by checkpoint interval. The writer will flush and commit the written files during checkpoint, until which point the data remains invisible.
Sort time will increase a lot then cause the high back pressure in Flink job.
Agree that it will need more sort time to keep whole file ordered. Not sure how significant the impact is, I remembered @Alowator has a ingestion benchmark which includes sorting of binary buffer here, and said sort performs fast enough so it doesn't affect write performance, where the default batch size is 256Mb to trigger flushing. Maybe you can double check that. cc @HuangZhenQiu
cc @danny0405 @zhangyue19921010 for final review.
CI report:
- cc72dd0758dbe12be18fcad9d3823e329f0f1937 UNKNOWN
- 46637fd4f3e9d1e2c48c25604488e9de4f009e4f UNKNOWN
- c692d75fd2c310b2fc4ece6fdbf93b7031ffbe9e UNKNOWN
- 3f5ee194af7f52bdc8ee57b8d433e5aab3525ad3 UNKNOWN
- f27de9281578b1a90147157700e9fa4da9e79593 UNKNOWN
- d5210d850a1ceff3474839e6ad153443f7309b00 UNKNOWN
- fd80089cab21891021a253b597374b9010bbbc6a UNKNOWN
- 3c20b563da24ee4734c352e8f049029ebbfd9b69 UNKNOWN
- 409dcb46ecdedd922ecf3fe8c0478b1c28f62ce0 Azure: SUCCESS
Bot commands
@hudi-bot supports the following commands:-
@hudi-bot run azurere-run the last Azure build