data-validator
data-validator copied to clipboard
Implement Column Statistics / Data Profiling for Numeric Columns
As discussed in our original Spark Summit presentation: See 22 min mark.
Listening to myself is awful btw.
Inspired by the nice visualization provided by Facets Overview while leveraging spark to handle large distributed data sets.
Initial considerations involve calculating the statistics as efficiently as possible. Some different approaches off of the top of my head include:
- calculating statistics using UDAFs. For exact calculation of the histogram and standard deviation this would appear to require at least two passes over the data.
- leveraging existing hive/sql functions
- exploring/using approximate methods for histograms, std dev, etc. on large data