Update the dataset workflow with new structure/format

Open theomeb opened this issue 4 years ago • 0 comments

The main idea (to be confirmed though) is to have for the user the following process:

The user adds raw data files such as (csv + npy for embeddings - to be extended to other formats as well)
The user defines a schema for variable types
The library converts the raw data files into a format used to load data in memory
The dataset instance can return native tf.data.Dataset or torch.Dataloader in order to train models with this dataset

For the last point, mainly two options:

convert dataset (csv with npy files) to hdf5 and then Apache Arrow or vaex to load it in memory
or if we want native tf/torch tensors in the end: convert datasets into Parquet and then use petastorm

Brainstorm has been done in a Notion doc. Next steps is to investigate properly the different options.

Other points:

Tensorflow or PyTorch should not be dependencies for the project, we need to put it as dependencies in environment.yaml rather than requirements.txt
The user needs to be able to use biodatasets package with either PyTorch or TF installed, so we need to manage import errors in both to_torch_dataset() and to_tf_dataset() and catch errors to display that one of these libraries should be installed if the user tries to call one of these functions.

Apr 23 '21 15:04 theomeb