Commit a parquet-mr patch that enables writing out row-group sizes smaller than 100
parquet-hadoop library does not support row-group sizes less then a 100 (PARQUET-409).
Until resolved by Parquet project, we should add a patch (or a reference to a pull request) + build instructions to make it easier for our users to generate parquet files with row groups smaller than a 100.
Is there a link to a patch (or even better, reference to a pull request) we can have a look at?
Don't remember the details since it was a long time ago. Try seeing if any of these references help: https://github.com/apache/parquet-mr/pull/470 https://issues.apache.org/jira/browse/PARQUET-409
From my experience, it's typically not a good idea to have parquet stores with small row-groups. It does violate a bunch of assumptions on the parquet store structure and makes you "fight" parquet library implementation a lot. It manifests as poor performance and large memory footprints in some scenarios.