valentine icon indicating copy to clipboard operation
valentine copied to clipboard

New feature: Get top n columns

Open michaelkonstantinou opened this issue 2 years ago • 2 comments

Resolves #52

As stated in issue #52 , it would be useful to be able to get the top n similar columns when analyzing the data. Since the issue is still open, I decided to add this feature myself as I could use it during my data preprocessing

Solution

This pull request adds two new methods into the metrics.py file

  • get_top_n_columns which returns a complete dictionary of all top-n columns for each column in the two datasets
  • For instance: {('Table1', 'Authors'): ['Authors, 'AccessList']}
  • get_top_n_columns_for_column Something that could be more useful in my opinion, to return the top columns for a specific dataset
  • For instance: What are the top two matches for column 'Access' of table 1?

I am not quite sure what exactly the OP wanted or what the team would prefer to, but at least a boilerplate is established and in case more information should be added that can be easily modified. (e.g. add float value next to it)

Additional changes

Added a new example to demonstrate the new feature. It uses a different algorithm though as COMA compares the names as well and in this case it might not be much informative

Notes

  • I didn't find a test case for metrics that's why I didn't add one
  • The code style being used complies with PEP-8

I hope this is useful. Let me know if you prefer any changes or any additional functionality.

michaelkonstantinou avatar Jul 11 '23 01:07 michaelkonstantinou

Hello @Mikhail-Konstantinou , first of all thank you for your contribution, it's great to see contributions from the outside being made.

Overall the code looks good!

I have a couple of comments for you to take into consideration:

  • I feel like the two methods have significant overlap in functionality. It would make sense if get_top_n_columns would use get_top_n_columns_for_column somehow. Another option is to provide get_top_n_columns with a keyword argument that allows to you specify a list of specific columns of df1 to use for top n in df2 (and by default have it pick all columns), so you could get rid of the second method that has an overly long name :)

  • Maybe it's nice to have a list of dicts, with column name as key and score as value, instead of just a list of column names. Doing this gives insight into the distribution of the scores. I think this is also what you suggested with the "add float value" remark.

EDIT: After a second look I dropped some of my comments, so I've adjusted the post.

Archer6621 avatar Sep 18 '23 19:09 Archer6621

@Archer6621

Hello and thanks for your input. I believe the final changes solve both of the issues/suggestions you mentioned

  1. Indeed, get_top_n_columns_for_column is long and not needed anymore. I refactored get_top_n_columns to accept a list of keys. If not, it returns all columns by default as you suggested. However, I changed it a bit and instead of choosing which columns from df1 you want... you can choose which columns you want either from df1 or from df2. I believe the latter is stronger, more flexible and cleaner
  2. Yes, now it returns a list of dicts. The column name is the key and the score is the value

PS. I checked the conflicting files that github complains about, and they are not related to this function. I believe you can merge it easily by selecting the line of code you think is correct

michaelkonstantinou avatar Oct 22 '23 19:10 michaelkonstantinou