Implement Column Statistics / Data Profiling for Numeric Columns

Open phpisciuneri opened this issue 5 years ago • 1 comments

As discussed in our original Spark Summit presentation: See 22 min mark.

Listening to myself is awful btw.

Inspired by the nice visualization provided by Facets Overview while leveraging spark to handle large distributed data sets.

May 20 '20 21:05 phpisciuneri

Initial considerations involve calculating the statistics as efficiently as possible. Some different approaches off of the top of my head include:

calculating statistics using UDAFs. For exact calculation of the histogram and standard deviation this would appear to require at least two passes over the data.
leveraging existing hive/sql functions
exploring/using approximate methods for histograms, std dev, etc. on large data

May 20 '20 21:05 phpisciuneri