spark-avro icon indicating copy to clipboard operation
spark-avro copied to clipboard

Alias and Doc support in Spark Schema

Open vasanthrj opened this issue 10 years ago • 11 comments

Preserves avro doc and aliases while save Include avro alias column in DataFrame schema

vasanthrj avatar May 30 '15 19:05 vasanthrj

Can someone review this? Please give some suggestion.

vasanthrj avatar Jun 10 '15 10:06 vasanthrj

I love the idea! And it seems to be working just fine.

A few suggestions...

  1. Let's ask @marmbrus if "_doc" and "_aliases" are a good naming convention. It would be nice to make it consistent with other formats supporting it.

  2. Is the implicit addAvroAliasColumns a good choice? It would be on every DataFrame - even after modifying columns and wiping out their metadata. Would it make sense to make it a property on SQLContext the same way as parquet settings are done currently?

  3. It would be nice to add an assert to the place where files are read after saving. Check that doc and aliases made the loop to disk and back.

  4. Would it not be enough and more readable to add the doc and alias on one field in the test only? It does not have to be tested on all the different fields, right? It would save some 100 lines of code in tests ;-)

jendap avatar Jul 19 '15 20:07 jendap

Thanks for working on this.

  • I'd consider making the addition of the alias columns just an option to the datasource, so that it is usable in languages other than scala.
  • Other users of metadata don't use _ so i'd just go with doc and aliases

marmbrus avatar Jul 20 '15 19:07 marmbrus

@jendap Implemented review comments Suggestion reply: 2. I added to dataframe since dataframe know the schema and can store metadata.Where aliases/doc are belong to a column data. I think parquet properties are related to deserialization, data type handling.

@JoshRosen Implemented review comments @marmbrus adding alias column is optional where alias columns added after invoking addAvroAliasColumns()

Build is failing due to some other reason. Kindly check

vasanthrj avatar Jul 22 '15 19:07 vasanthrj

@jendap Implemented your review comments. Please review.

vasanthrj avatar Jul 26 '15 04:07 vasanthrj

Current coverage is 93.79%

Merging #51 into master will increase coverage by +0.39% as of ccc433e

@@            master     #51   diff @@
======================================
  Files            6       6       
  Stmts          273     306    +33
  Branches        45      57    +12
  Methods          0       0       
======================================
+ Hit            255     287    +32
  Partial          0       0       
- Missed          18      19     +1

Review entire Coverage Diff as of ccc433e

Powered by Codecov. Updated on successful CI builds.

codecov-io avatar Jul 26 '15 04:07 codecov-io

@marmbrus adding alias column is optional where alias columns added after invoking addAvroAliasColumns()

I realize its optional, but implementing this feature as a scala function means that users in Python and R will not be able to use it. I think instead this should just be a parameter that is passed in the OPTIONS clause or as an option(...) so that it works across languages.

marmbrus avatar Jul 27 '15 19:07 marmbrus

@marmbrus Implemented review comments. Added a option. Please review

vasanthrj avatar Aug 16 '15 14:08 vasanthrj

Thanks again for working on this! I did another review pass.

marmbrus avatar Aug 25 '15 19:08 marmbrus

@marmbrus implemented review comments. Except the usage of "withAliasesFields" of in buildScan(...) of AvroRelation.scala, I was getting: object not serializable (class: org.apache.avro.Schema$RecordSchema,

vasanthrj avatar Sep 09 '15 20:09 vasanthrj

What happened with this PR? Are you planning include this in future releases? It would be useful support aliases.

gasparms avatar Nov 22 '17 14:11 gasparms