Concept of managed datasets for create/update/delete

Open mortenpi opened this issue 3 years ago • 0 comments

One issue that is see with implementing create/update/delete operations (#31, #38) here in DataSets is that different data repositories may have a very different ideas of how to execute them, and may require repository-specific information.

A case in point: the TOML-based data repos generally just link to data. Should deletion delete the linked file, or just the metadata? If you create a new dataset, where will the file be? Do you need to pass some options?

One design goal of DataSets is that it provides a universal, relocatable interface. So if you create datasets in a script, that should work consistently, even if you move to a different repository. But if you have to pass repository-specific options, that breaks that principle.

To provide create/update/delete functionality in a generic way, we could have the notion of managed datasets. Basically, the data repository fully owns and controls the storage. When you create a dataset, you essentially just hand it over to the repository, and as the user you can not exercise any more control in your script.

For a remote, managed storage of datasets, this is how it must work by definition. But we should also have this for the local Data.toml-based repositories. I imagine that your repository would manage a directory somewhere where the data actually gets store, e.g.:

my-data-project/Project.toml
               /Data.toml
               /.datasets/<uuid1-for-File>
               /.datasets/<uuid2-for-FileTree>/foo.csv
               /.datasets/<uuid2-for-FileTree>/bar.csv

If now you create a dataset in a local project from a file with something like

DataSets.create("new-ds-name", "local/file.csv")

it will generate a UUID for it and just copy it to .datasets/<uuid>. This way we also do not have any problems with e.g. trying to infer destination file names and running into conflicts.

A few closing thoughts:

A data repo might not support managed datasets at all. That's fine, you just can't create/update/delete datasets then, just read existing ones. It may also have some datasets that are unmanaged, even if it otherwise does support them.
All "linked" datasets in a TOML file would be unmanaged, and hence read-only. It would even be worth implementing them via a separate storage driver, in order not to conflate it with the implementation for standard datasets. Not sure about an API for creating such a dataset -- it probably would have to be specific to a data repo, because such a dataset only make sense for some repositories.
You might be able to convert linked datasets into managed ones though, which will copy it to the repositories storage (whatever that may be).

Nov 10 '22 01:11 mortenpi