AIF360 icon indicating copy to clipboard operation
AIF360 copied to clipboard

Define "data stores" (to make it easy to download data from various sources)

Open anupamamurthi opened this issue 3 years ago • 6 comments

Signed-off-by: [email protected] Resolves: #343

anupamamurthi avatar Sep 16 '22 19:09 anupamamurthi

A couple thoughts:

  • Can these just be functions? Is there a use case that requires it to be a class? This would make it more in line with fetch_openml
  • Pandas can download from a URL natively (e.g. read_csv). The drawback is it won't cache the file. It also doesn't work with zipped files...
  • A lot of the utils seem unused. Are they necessary?

hoffmansc avatar Sep 16 '22 19:09 hoffmansc

Hi @hoffmansc this is my take and thinking, but happy to do whatever is best for the project. Let me know ur thoughts.

  1. Classes maybe handy to do things like -validate to see if the incoming UCI URL or Kaggle URL is valid and we can extend these classes in the future (if needed) by adding additional error checks that are specific to the "data store". For COS, I can think of how we may want to "list cos objects" or get a particular version of the cos object instead of getting the latest etc. Maybe too futuristic. but one could argue saying we can live with functions (similar to fetch_openml). I am fine with that as well.

regarding 3rd point, yeah I thought these utils may come handy but I can remove them and can be added as and when needed!

anupamamurthi avatar Sep 16 '22 19:09 anupamamurthi

This pull request introduces 4 alerts when merging 10d2e9a51c084f90694f116e706a8656d219a766 into cabb8920dedaa7ab65b5682a1a9e28979327519e - view on LGTM.com

new alerts:

  • 3 for Unused import
  • 1 for Module is imported more than once

lgtm-com[bot] avatar Sep 16 '22 20:09 lgtm-com[bot]

I guess functions seems simpler to me. I'm worried this is a little overcomplicated. For example, it seems like the kaggle function essentially does what we need (assuming it checks if the dataset is valid or already exists). Versioning and error checks can still happen inside the function. Listing available datasets could be a separate function. The only thing is if authentication gets cumbersome to input but, at least for kaggle, it's stored in a file so this isn't an issue.

hoffmansc avatar Sep 16 '22 21:09 hoffmansc

I hear you and Karthi shared the same concern :) I will fix it up. Will have some util functions instead!

anupamamurthi avatar Sep 16 '22 21:09 anupamamurthi

Another thing I thought of is we only check if the file names are present in the cache dir but it's possible two URLs have the same file names but different contents. In other words, the cache "key" should be the URL+file. Is this an issue or should we ignore this unlikely problem? Maybe we could hash the URL somehow and save the files in a folder under that unique name? It sacrifices readability though

hoffmansc avatar Sep 16 '22 21:09 hoffmansc