dataprep icon indicating copy to clipboard operation
dataprep copied to clipboard

EDA: Report of Comparing Dataframs (create_diff_report)

Open jinglinpeng opened this issue 4 years ago • 2 comments

Is your feature request related to a problem? Please describe. Create a report to compare dataframes. The report is like sweetviz and our create_report function.

Describe the solution you'd like The API is similar to create_report and is as follows:

create_diff_report(
    dfs: Union[List[DataFrame], Dict[str, DataFrame]],    
    config: Optional[Dict[str, Any]] = None,
    display: Optional[List[str]] = None,
    title: Optional[str] = "DataFrame Difference Report by DataPrep",
    mode: Optional[str] = "basic",
    progress: bool = True, )

The dfs is a list of dataframes or a dict of dataframes. E.g., user can call create_diff_report([df1, df2]) or create_diff_report({'train': df1, 'test': df2}). In the former case df is named as 'df1', 'df2'. In the later case the key is the name of the dataframe.

The layout of this function is similar to create_report. It has the following sections:

1. Overview. The overview section is like the overview in create_report. The content is from plot_diff([df1, df2]), as shown in the following figure. image

2. Variables The layout is similar to the Variables section in create_report, or image The difference is that:

  1. for the content we need to change the single dataframe statistics to multiple dataframes statistics. The layout is like what we did in plot_diff([df1, df2], x): image
  2. for the fig we need to change it to the fig of distribution comparison, e.g., show hist comparison for numerical column and bar chart comparison for categorical column. The following figs show the hist comparison and bar chart comparison fig: image
  3. In show details button, we change each tab to its multiple dataframes version.

3. ...To be continued

jinglinpeng avatar Oct 08 '21 22:10 jinglinpeng

@jinglinpeng

Is this supposed to be a new function or am I supposed to modify the existing create_report function? Also, if this is a new function where should I define it? I would assume in the same file as the existing create_report.

devinllu avatar Oct 12 '21 18:10 devinllu

@devinllu This is a new function, the relationship of it to plot_diff is similar to create_report vs plot. You can define it under the eda module: https://github.com/sfu-db/dataprep/tree/develop/dataprep/eda.

jinglinpeng avatar Oct 12 '21 18:10 jinglinpeng