DataSets.jl icon indicating copy to clipboard operation
DataSets.jl copied to clipboard

CRUD updates, dataset mutation and BlobTree API updates

Open c42f opened this issue 3 years ago • 0 comments

This is a big batch of changes, implementing

  • A big rewrite of the BlobTree API to make it more coherent and simpler
  • Allow a dataset to be opened for mutation with open(write=true).
  • A CRUD interface for modifying DataProject
  • Changes to the data driver interface to support all this

A lot of these changes are intertwined so I've put all this here as a draft, but I'll probably need to break this apart into separate PRs.

BlobTree

BlobTree now has a largely dictionary-like interface:

  • List keys (ie, file and directory names): keys(tree)
  • List keys and values: pairs(tree)
  • Query keys: haskey(tree, path)
  • Traverse the tree: tree[path]
  • Add new content: newdir(tree, path), newfile(tree, path)
  • Delete content: delete!(tree, path)

Where path is either a relative path RelPath type, or an AbstractString (in which case it'll be split on / to become a relative path).

Unlike Dict, iteration of BlobTree currently iterates values (not key value pairs). This has some benefits - for example, broadcasting processing across files in a directory.

  • Property access
    • isdir(), isfile() - determine whether a child of tree is a directory or file.

Example

You can create a new temporary BlobTree via the newdir() function and fill it with combinations of newfile() or newdir()

julia> dir = newdir()
       for i = 1:3
           newfile(dir, "\$i/a.txt") do io
               println(io, "Content of a")
           end
           newfile(dir, "b-\$i.txt") do io
               println(io, "Content of b")
           end
       end
       dir
📂 Tree  @ /tmp/jl_Sp6wMF
 📁 1
 📁 2
 📁 3
 📄 b-1.txt
 📄 b-2.txt
 📄 b-3.txt

You can also get access to a BlobTree by using DataSets.from_path() with a local directory name. For example:

julia> using Pkg
       open(DataSets.from_path(joinpath(Pkg.dir("DataSets"), "src")))
📂 Tree  @ ~/.julia/dev/DataSets/src
 📄 DataSet.jl
 📄 DataSets.jl
 📄 DataTomlStorage.jl
 ...

AbstractDataProject interface additions

To support CRUD of datasets (#31) within data projects, the data driver interface needs much more flexibility. I've added:

  • DataSets.create() to create datasets — still needs some refinement, in particular the keyword parameters.
  • Base.setindex!() to add a dataset to a project
  • DataSets.delete() to delete datasets
  • Implementations for StackedDataProject, AbstractTOMLDataProject and TOMLDataProject

Relatedly, I've added DataSets.from_path() to create a standalone DataSet from data on the local filesystem, inferring the type as Blob or BlobTree. This can be passed as a source to create() to make a copy.

Still TODO here is DataSets.config (or some such) to update the metadata of a DataSet (alternatively — have the dataset know its owning data project and call back into that when it's updated?)

Low level AbstractDataDriver interface

The low level driver interface is currently (in 0.2.6) just a function taking a user-defined callback.

However, to support CRUD operations for DataProject it needs to be expanded quite a bit. In particular to be able to create and delete storage in the storage backend. This PR adds AbstractDataDriver and, so far a single implementation FileSystemDriver with implementations of

  • open_dataset to do what the current function-based API does
  • close_dataset to cleanup any dataset resources, also indicating whether the close happened due to an exception.
  • create_storage to initialize storage
  • delete_storage to remove storage

This interface is probably still a bit half-baked and needs some refinement.

c42f avatar Apr 26 '22 11:04 c42f