New feature: Get top n columns
Resolves #52
As stated in issue #52 , it would be useful to be able to get the top n similar columns when analyzing the data. Since the issue is still open, I decided to add this feature myself as I could use it during my data preprocessing
Solution
This pull request adds two new methods into the metrics.py file
- get_top_n_columns which returns a complete dictionary of all top-n columns for each column in the two datasets
- For instance: {('Table1', 'Authors'): ['Authors, 'AccessList']}
- get_top_n_columns_for_column Something that could be more useful in my opinion, to return the top columns for a specific dataset
- For instance: What are the top two matches for column 'Access' of table 1?
I am not quite sure what exactly the OP wanted or what the team would prefer to, but at least a boilerplate is established and in case more information should be added that can be easily modified. (e.g. add float value next to it)
Additional changes
Added a new example to demonstrate the new feature. It uses a different algorithm though as COMA compares the names as well and in this case it might not be much informative
Notes
- I didn't find a test case for metrics that's why I didn't add one
- The code style being used complies with PEP-8
I hope this is useful. Let me know if you prefer any changes or any additional functionality.
Hello @Mikhail-Konstantinou , first of all thank you for your contribution, it's great to see contributions from the outside being made.
Overall the code looks good!
I have a couple of comments for you to take into consideration:
-
I feel like the two methods have significant overlap in functionality. It would make sense if
get_top_n_columnswould useget_top_n_columns_for_columnsomehow. Another option is to provideget_top_n_columnswith a keyword argument that allows to you specify a list of specific columns of df1 to use for top n in df2 (and by default have it pick all columns), so you could get rid of the second method that has an overly long name :) -
Maybe it's nice to have a list of dicts, with column name as key and score as value, instead of just a list of column names. Doing this gives insight into the distribution of the scores. I think this is also what you suggested with the "add float value" remark.
EDIT: After a second look I dropped some of my comments, so I've adjusted the post.
@Archer6621
Hello and thanks for your input. I believe the final changes solve both of the issues/suggestions you mentioned
- Indeed, get_top_n_columns_for_column is long and not needed anymore. I refactored get_top_n_columns to accept a list of keys. If not, it returns all columns by default as you suggested. However, I changed it a bit and instead of choosing which columns from df1 you want... you can choose which columns you want either from df1 or from df2. I believe the latter is stronger, more flexible and cleaner
- Yes, now it returns a list of dicts. The column name is the key and the score is the value
PS. I checked the conflicting files that github complains about, and they are not related to this function. I believe you can merge it easily by selecting the line of code you think is correct