dataprep icon indicating copy to clipboard operation
dataprep copied to clipboard

Using plot_diff to compare DataFrames

Open eutialia opened this issue 5 years ago • 2 comments

Summary

The principle of the feature is to provide an high-level API getting the difference between similar DataFrames. It also works on the subsets of a single DataFrame.

Design-level Explanation Actions

  • [x] Design the overall architecture of the plot_diff
  • [x] Use the existing components as the reference.
  • [x] Render the prototype.
  • [ ] Polish the detail for each API

Design-level Explanation

plot_diff([df1, df2, ..., dfn])
# This will show the summary of all DataFrames, it's similar to what we already have in plot(df).

UI mock up: image


plot_diff([df1, df2, ..., dfn], x)
# This will show the difference between the DataFrames on a specific column, if the column exists at least in one of the dfs.
# Including the distribution and correlations.

plot_diff(df, x, ["cat_1", "cat_2"])
# This will compare the subsets of a single DataFrame
# by querying categorical variable into two categories, it's similar to what plot_diff(df1, df2, x) does.

plot_diff(df, x, ["[0:100]", "[100:200]"])
# This will compare the two subsets of a single DataFrame
# by slicing the numerical variable in the given window, it's similar to what plot_diff(df1, df2, x) does.

plot_diff(df, x, ["[2020-01-01:2020-07-01]", "[2020-07-01:2021-01-01]"])
# This will compare the two subsets of a single DataFrame
# by slicing the time related data in the given window, it's similar to what plot_diff(df1, df2, x) does.
image If the two dfs' column names are identical or have some intersection or have inclusion relationship, they are comparable.

Implementation-level Explanation

WIP

Rational and Alternatives

WIP

Prior Art

Using existing components from plot(df) and plot(df, x) We also use sweetviz as the reference for prototyping.

Future Possibilities

Better way to determine if the given DataFrames are comparable.

Implementation-level Actions

Additional Tasks

  • [x] This task is put into a correct pipeline (Development Backlog or In Progress).
  • [x] The label of this task is setting correctly.
  • [x] The issue is assigned to the correct person.
  • [x] The issue is linked to related Epic.
  • [ ] The documentation is changed accordingly.
  • [ ] Tests are added accordingly.

eutialia avatar Jan 20 '21 00:01 eutialia

Looks nice!

A reference for dataframe comparison: https://towardsdatascience.com/powerful-eda-exploratory-data-analysis-in-just-two-lines-of-code-using-sweetviz-6c943d32f34

The slicing functionality may be complicated to understand for users.

Besides, not sure whether we also need to consider row-wise dataframe comparison.

jinglinpeng avatar Jan 20 '21 02:01 jinglinpeng

Ya nice job Ling! I think all of the tasks you listed are important. Also, for your consideration, here's a thread from Jiannan's reddit post that gives some ideas for plot_diff.

brandonlockhart avatar Jan 20 '21 07:01 brandonlockhart