Ivan issues

Results 16 issues of


                                            Ivan

Collapse/show button to hide long text for the note

enhancement

Limit writing row groups based either on size or number of records

This is derived from #116. Currently we do not limit row groups size in any way, even though we have options in `WriterProperties` for it, like max row group size....

enhancement

Support writing statistics

This is derived from #116. It would good to add statistics support for a write path, since currently we do not write statistics. I think it only needs to be...

enhancement

Improve parquet-schema and parquet-read CLI tools

This is a follow-up of #156. CLI tools `parquet-schema` and `parquet-read` could be improved with a better help message and parameter support, since the current version of both tools has...

enhancement

[SPARK-39833][SQL] Remove partition columns from data schema in the case of overlapping columns to fix Parquet DSv1 incorrect count issue

### What changes were proposed in this pull request? This PR updates schema inference in DSv1 FileFormat to remove overlapping columns from the data schema and keep them in the...

SQL

Handle Dictionary pages and build filters from those

Currently we build statistics without accounting for dictionary pages. We should either have dictionary page statistics without column filters, or, if there is a fallback, have a split of statistics....

Consider cache for query plan

Currently we cache filter statistics and table metadata for each queried table. This issue is about caching query plan, so when we hit the same plan, we can yield result...

question

Add disk spill for dictionary filter

Currently when building dictionary filters we have to keep it memory. This should spill to disk after certain threshold.

Investigate different implementation of ParquetReader

Currently we are using Spark Parquet reader, this issue is about investigating if we can extract data pages and index those including each page statistics. During scan we would select...

question

Add performance tests and benchmark

Add performance tests to compare with Parquet implementation or compare performance against releases. This should be run as part of CI to determine if there is a regression in performance.

enhancement