enceladus icon indicating copy to clipboard operation
enceladus copied to clipboard

Add ability to configure how Spark handles dates in parquet files.

Open benedeki opened this issue 2 years ago • 3 comments

Background

With Spark 3 new option were added how to work with dates pre 1900 in parquet files The settings are: spark.sql.parquet.datetimeRebaseModeInRead spark.sql.parquet.datetimeRebaseModeInWrite spark.sql.parquet.int96RebaseModeInRead spark.sql.parquet.int96RebaseModeInWrite

Details here.

Feature

Allow setting of the options for Enceladus jobs

### Tasks
- [ ] ~Add command line options to be able to set the **read** options. Set a default behavior either to `EXCEPTION` or `LEGACY`.~
- [ ] ~Modify the helper scripts to recognize these settings~
- [ ] ~Add an `reference.conf`/`application.conf` setting to be applied to write options. The default should be `LEGACY`~
- [ ] Modify the helper scripts to be able to easily send the Spark settings into the `spark submit` - the defaults remain the same as described above

To discuss

  • The command line option names
  • The command line defaults
  • The write configuration names

benedeki avatar Feb 17 '23 16:02 benedeki

This behaviour can be reached by adding: --conf spark.sql.parquet.datetimeRebaseModeInRead=LEGACY --conf spark.sql.parquet.datetimeRebaseModeInWrite=LEGACY into spark job json file call "spark-submit": "spark-submit --num-executors 2 --executor-memory 2G --deploy-mode client --conf spark.sql.parquet.datetimeRebaseModeInRead=LEGACY --conf spark.sql.parquet.datetimeRebaseModeInWrite=LEGACY",

No code changes in Enceladus are needed. See example usage in json file.

miroslavpojer avatar Feb 24 '23 09:02 miroslavpojer

This behaviour can be reached by adding: --conf spark.sql.parquet.datetimeRebaseModeInRead=LEGACY --conf spark.sql.parquet.datetimeRebaseModeInWrite=LEGACY into spark job json file call "spark-submit": "spark-submit --num-executors 2 --executor-memory 2G --deploy-mode client --conf spark.sql.parquet.datetimeRebaseModeInRead=LEGACY --conf spark.sql.parquet.datetimeRebaseModeInWrite=LEGACY",

No code changes in Enceladus are needed. See example usage in json file.

Great finding and solution. So only the Helper scripts needs to be enhanced.

benedeki avatar Mar 01 '23 20:03 benedeki

Yes

miroslavpojer avatar Mar 14 '23 06:03 miroslavpojer