petastorm icon indicating copy to clipboard operation
petastorm copied to clipboard

Commit a parquet-mr patch that enables writing out row-group sizes smaller than 100

Open selitvin opened this issue 7 years ago • 2 comments

parquet-hadoop library does not support row-group sizes less then a 100 (PARQUET-409). Until resolved by Parquet project, we should add a patch (or a reference to a pull request) + build instructions to make it easier for our users to generate parquet files with row groups smaller than a 100.

selitvin avatar Aug 23 '18 01:08 selitvin

Is there a link to a patch (or even better, reference to a pull request) we can have a look at?

ingolfured avatar Jul 09 '21 09:07 ingolfured

Don't remember the details since it was a long time ago. Try seeing if any of these references help: https://github.com/apache/parquet-mr/pull/470 https://issues.apache.org/jira/browse/PARQUET-409

From my experience, it's typically not a good idea to have parquet stores with small row-groups. It does violate a bunch of assumptions on the parquet store structure and makes you "fight" parquet library implementation a lot. It manifests as poor performance and large memory footprints in some scenarios.

selitvin avatar Jul 19 '21 16:07 selitvin