[FEATURE] A new Spark SQL command to merge small files

Open gabry-lab opened this issue 1 year ago • 0 comments

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Search before asking

[X] I have searched in the issues and found no similar issues.

Describe the feature

A new Spark SQL command to merge small files

compact table table_name [INTO ${targetFileSize} ${targetFileSizeUnit} ] [ cleanup | retain | list ]
-- targetFileSizeUnit can be 'b','k','m','g','t','p'
-- cleanup means cleaning compact staging folders, which contains original small files, default behavior
-- retain means retaining compact staging folders, for testing, and we can recover with the staging data
-- list means this command only get the merging result, and don't run actually

recover compact table table_name
-- recover a table if compact table command fails

Motivation

There are many cases in which a SQL generate small files, we MUST merge them into bigger ones.

Describe the solution

This command doesn't read-write all of the records of a table, it just merges files in a binary level. Take a CSV table for example, it only appends the byte array from one file to another one, without reading & writing records

Additional context

referring to a blog

Are you willing to submit PR?

[X] Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve.
[ ] No. I cannot submit a PR at this time.

Sep 12 '24 12:09 gabry-lab