Design and implement a mechanism to download datasets from Kaggle
Category
Datasets
Why is this issue important
At this point in time, AIF360 doesn't support an easy way to download datasets from places like Kaggle. One should manually download the data before running experiments.
How to go about this issue
To understand how datasets are used, we would recommend poking around https://github.com/Trusted-AI/AIF360/tree/master/aif360/data/raw and modules in https://github.com/Trusted-AI/AIF360/tree/master/aif360/datasets
The datasets are used in notebooks as seen here: https://github.com/Trusted-AI/AIF360/blob/master/examples/demo_meta_classifier.ipynb (This is one example but feel free to poke around various examples)
Once you get a sense on what is going on with the data and how the datasets are used, it will be easy to proceed with the implementation mentioned below.
Implementation / proposal
- [ ] Define a base class (something similar to the below)
import abc
import copy
import sys
import abc
if sys.version_info >= (3, 4):
ABC = abc.ABC
else:
ABC = abc.ABCMeta(str('ABC'), (), {})
from datasets_lib.utils import get_logger
logging = get_logger(__name__)
class Store(ABC):
@abc.abstractmethod
def __init__(self, **kwargs):
pass
@abc.abstractmethod
def validate_store(self, **kwargs):
pass
@abc.abstractmethod
def download(self, **kwargs):
pass
@abc.abstractmethod
def upload(self, **kwargs):
pass
- [ ] Implement KaggleStore that will override the functions in the baseclass
Overall, the idea is to have the ability to download data from Kaggle using this helper.
There is no need to stick to the above definition.
How to test ?
Adding a good unit test will certainly help test the above logic/code.
If all goes well, it will nice to go over the datasets that are available to see if we can download them directly using KaggleStore instead of hardcoding the location of the data as seen here: https://github.com/Trusted-AI/AIF360/tree/1de824717be15e2a0ebabe9bd8a718787196af73/aif360/datasets
Can you explain these methods a little more? What does upload do?
We can also look into: https://www.kaggle.com/docs/api
Looks like you need an API key, though