Alias and Doc support in Spark Schema
Preserves avro doc and aliases while save Include avro alias column in DataFrame schema
Can someone review this? Please give some suggestion.
I love the idea! And it seems to be working just fine.
A few suggestions...
-
Let's ask @marmbrus if "_doc" and "_aliases" are a good naming convention. It would be nice to make it consistent with other formats supporting it.
-
Is the implicit addAvroAliasColumns a good choice? It would be on every DataFrame - even after modifying columns and wiping out their metadata. Would it make sense to make it a property on SQLContext the same way as parquet settings are done currently?
-
It would be nice to add an assert to the place where files are read after saving. Check that doc and aliases made the loop to disk and back.
-
Would it not be enough and more readable to add the doc and alias on one field in the test only? It does not have to be tested on all the different fields, right? It would save some 100 lines of code in tests ;-)
Thanks for working on this.
- I'd consider making the addition of the alias columns just an option to the datasource, so that it is usable in languages other than scala.
- Other users of metadata don't use
_so i'd just go withdocandaliases
@jendap Implemented review comments Suggestion reply: 2. I added to dataframe since dataframe know the schema and can store metadata.Where aliases/doc are belong to a column data. I think parquet properties are related to deserialization, data type handling.
@JoshRosen Implemented review comments @marmbrus adding alias column is optional where alias columns added after invoking addAvroAliasColumns()
Build is failing due to some other reason. Kindly check
@jendap Implemented your review comments. Please review.
Current coverage is 93.79%
Merging #51 into master will increase coverage by +0.39% as of
ccc433e
@@ master #51 diff @@
======================================
Files 6 6
Stmts 273 306 +33
Branches 45 57 +12
Methods 0 0
======================================
+ Hit 255 287 +32
Partial 0 0
- Missed 18 19 +1
Review entire Coverage Diff as of
ccc433ePowered by Codecov. Updated on successful CI builds.
@marmbrus adding alias column is optional where alias columns added after invoking addAvroAliasColumns()
I realize its optional, but implementing this feature as a scala function means that users in Python and R will not be able to use it. I think instead this should just be a parameter that is passed in the OPTIONS clause or as an option(...) so that it works across languages.
@marmbrus Implemented review comments. Added a option. Please review
Thanks again for working on this! I did another review pass.
@marmbrus implemented review comments. Except the usage of "withAliasesFields" of in buildScan(...) of AvroRelation.scala, I was getting: object not serializable (class: org.apache.avro.Schema$RecordSchema,
What happened with this PR? Are you planning include this in future releases? It would be useful support aliases.