DataSets.jl icon indicating copy to clipboard operation
DataSets.jl copied to clipboard

Concept of managed datasets for create/update/delete

Open mortenpi opened this issue 3 years ago • 0 comments

One issue that is see with implementing create/update/delete operations (#31, #38) here in DataSets is that different data repositories may have a very different ideas of how to execute them, and may require repository-specific information.

A case in point: the TOML-based data repos generally just link to data. Should deletion delete the linked file, or just the metadata? If you create a new dataset, where will the file be? Do you need to pass some options?

One design goal of DataSets is that it provides a universal, relocatable interface. So if you create datasets in a script, that should work consistently, even if you move to a different repository. But if you have to pass repository-specific options, that breaks that principle.

To provide create/update/delete functionality in a generic way, we could have the notion of managed datasets. Basically, the data repository fully owns and controls the storage. When you create a dataset, you essentially just hand it over to the repository, and as the user you can not exercise any more control in your script.

For a remote, managed storage of datasets, this is how it must work by definition. But we should also have this for the local Data.toml-based repositories. I imagine that your repository would manage a directory somewhere where the data actually gets store, e.g.:

my-data-project/Project.toml
               /Data.toml
               /.datasets/<uuid1-for-File>
               /.datasets/<uuid2-for-FileTree>/foo.csv
               /.datasets/<uuid2-for-FileTree>/bar.csv

If now you create a dataset in a local project from a file with something like

DataSets.create("new-ds-name", "local/file.csv")

it will generate a UUID for it and just copy it to .datasets/<uuid>. This way we also do not have any problems with e.g. trying to infer destination file names and running into conflicts.

A few closing thoughts:

  • A data repo might not support managed datasets at all. That's fine, you just can't create/update/delete datasets then, just read existing ones. It may also have some datasets that are unmanaged, even if it otherwise does support them.
  • All "linked" datasets in a TOML file would be unmanaged, and hence read-only. It would even be worth implementing them via a separate storage driver, in order not to conflate it with the implementation for standard datasets. Not sure about an API for creating such a dataset -- it probably would have to be specific to a data repo, because such a dataset only make sense for some repositories.
  • You might be able to convert linked datasets into managed ones though, which will copy it to the repositories storage (whatever that may be).

mortenpi avatar Nov 10 '22 01:11 mortenpi