easydata
easydata copied to clipboard
A flexible template for doing reproducible data science in Python.
Multiple forks of the same repo locally lead to multiple `src` modules, with only one of them installed with the "correct" paths. Hmmmm....how to avoid this? Or make it robust...
I should be able to download a LICENSE or README from a URL and add them to a datasource
With `hash_value=None` it should compute the hash and store it. It's expecting a value anyway... ``` --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) in ----> 1 dsrc.fetch() .../src/data/datasets.py in fetch(self,...
`workflow.add_dataset `should wrap the appropriate dag calls. For example: `workflow.add_dataset(dataset_name='wine_reviews_130k', datasource_name='wine_reviews')` should simply wrap `dag.add_source(output_dataset='wine_reviews_130k', datasource_name='wine_reviews')`
in the docstring is says > If a cached copy of the dataset is present on disk, (and its hashes match those in the dataset catalog), > the cached copy...
When running `make create_environment` it seems to be using the lock file: ``` /bin/conda env update -n covid_nlp -f environment.i386.lock.yml Collecting package metadata (repodata.json): done Solving environment: done ``` is...
In Makefile.Include there is a hard-coded CONDA-EXE path. Is there a way to at least issue a warning if you try to make it using someone else's path? (like when...
`make data` and `make sources` both end in an error if there is no process function: ``` python3 -m src.data.make_dataset process 2020-03-21 12:35:54,219 - datasets - INFO - Running process...
Make clean currently runs rm -rf commands. Bad things can happen if either your paths aren't set right, or you share your data directory. Clean based on file names.
Make data on demand. That is, if the instructions are there in the catalog, Dataset.load("datasetname") should just work. Even if no fetching, unpacking, or processing has happened yet.