spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[FEA] Cudf parquet binary write/read changing to uint8 lists

Open hyperbolic2346 opened this issue 3 years ago • 1 comments

Is your feature request related to a problem? Please describe. Cudf is changing the return type of a parquet binary read from list to list and also requiring list for writing. This change came about from this issue and is implemented in this PR.

Describe the solution you'd like Spark Rapids plugin needs to support this change. Care to handle the transition is needed as well to prevent breakage when the cudf PR is merged. It would be best to change Spark Rapids before this cudf change so nothing breaks, but this could involve multiple changes or more complex changes. Cudf will always write the type that is read, so it is possible no major changes are necessary here.

Describe alternatives you've considered Another option is to allow the cudf PR to merge and then deal with breakage, but this seems reactive and incorrect. Spark Rapids could also push back and request this change not merge into cudf.

hyperbolic2346 avatar Aug 24 '22 18:08 hyperbolic2346

Will be resolved bye https://github.com/rapidsai/cudf/pull/11539 -- since BinaryType is used as an intermediate state the read, write, and cast are the primary ways to get to and from the data type. Making the string to byte list cast consistent resolved all inconsistencies found in integration tests and my separate testing.

rwlee avatar Oct 12 '22 20:10 rwlee