bio-datasets
bio-datasets copied to clipboard
Update the dataset workflow with new structure/format
The main idea (to be confirmed though) is to have for the user the following process:
- The user adds raw data files such as (csv + npy for embeddings - to be extended to other formats as well)
- The user defines a schema for variable types
- The library converts the raw data files into a format used to load data in memory
- The dataset instance can return native
tf.data.Datasetortorch.Dataloaderin order to train models with this dataset
For the last point, mainly two options:
- convert dataset (
csvwithnpyfiles) tohdf5and then Apache Arrow or vaex to load it in memory - or if we want native
tf/torchtensors in the end: convert datasets into Parquet and then use petastorm
Brainstorm has been done in a Notion doc. Next steps is to investigate properly the different options.
Other points:
- Tensorflow or PyTorch should not be dependencies for the project, we need to put it as dependencies in
environment.yamlrather thanrequirements.txt - The user needs to be able to use
biodatasetspackage with either PyTorch or TF installed, so we need to manageimporterrors in bothto_torch_dataset()andto_tf_dataset()and catch errors to display that one of these libraries should be installed if the user tries to call one of these functions.