bio-datasets icon indicating copy to clipboard operation
bio-datasets copied to clipboard

Update the dataset workflow with new structure/format

Open theomeb opened this issue 4 years ago • 0 comments

The main idea (to be confirmed though) is to have for the user the following process:

  • The user adds raw data files such as (csv + npy for embeddings - to be extended to other formats as well)
  • The user defines a schema for variable types
  • The library converts the raw data files into a format used to load data in memory
  • The dataset instance can return native tf.data.Dataset or torch.Dataloader in order to train models with this dataset

For the last point, mainly two options:

  • convert dataset (csv with npy files) to hdf5 and then Apache Arrow or vaex to load it in memory
  • or if we want native tf/torch tensors in the end: convert datasets into Parquet and then use petastorm

Brainstorm has been done in a Notion doc. Next steps is to investigate properly the different options.

Other points:

  • Tensorflow or PyTorch should not be dependencies for the project, we need to put it as dependencies in environment.yaml rather than requirements.txt
  • The user needs to be able to use biodatasets package with either PyTorch or TF installed, so we need to manage import errors in both to_torch_dataset() and to_tf_dataset() and catch errors to display that one of these libraries should be installed if the user tries to call one of these functions.

theomeb avatar Apr 23 '21 15:04 theomeb